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Computationally  and  numerically  scalable  algorithms  are  needed  to  exploit  emerg- 
ing parallel-computing  capabilities.  In  this  work  pressure-based  algorithms  which 
solve  the  two-dimensional  incompressible  Navier-Stokes  equations  are  developed  for 
single-instruction  stream/multiple-data  stream  (SIMD)  computers. 

The  implications  of  the  continuity  constraint  for  the  proper  numerical  treatment 
of  open  boundary  problems  are  investigated.  Mass  must  be  conserved  globally  so  that 
the  system  of  linear  algebraic  pressure-correction  equations  is  numerically  consistent. 
The  convergence  rate  is  poor  unless  global  mass  conservation  is  enforced  explicitly. 
Using  an  additive-correction  technique  to  restore  global  mass  conservation,  flows 
which  have  recirculating  zones  across  the  open  boundary  can  be  simulated. 

The  performance  of  the  single-grid  algorithm  is  assessed  on  three  massively- 
parallel  computers,  MasPar's  MP-1  and  Thinking  Machines'  CM-2  and  CM-5.  Paral- 
lel efficiencies  approaching  0.8  are  possible  with  speeds  exceeding  that  of  traditional 
vector  supercomputers.  The  following  issues  relevant  to  the  variation  of  parallel  ef- 
ficiency with  problem  size  are  studied:  the  suitability  of  the  algorithm  for  SIMD 
computation;  the  implementation  of  boundary  conditions  to  avoid  idle  processors; 
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the  choice  of  point  versus  Hne-iterative  relaxation  schemes;  the  relative  costs  of  the 
coefficient  computations  and  solving  operations,  and  the  variation  of  these  costs  with 
problem  size;  the  effect  of  the  data-array-to-processor  mapping;  and  the  relative 
speeds  of  computation  and  communication  of  the  computer. 

A  nonlinear  pressure-correction  multigrid  algorithm  which  has  better  convergence 
rate  characteristics  than  the  single-grid  method  is  formulated  and  implemented  on 
the  CM-5.  On  the  CM-5,  the  components  of  the  multigrid  algorithm  are  tested  over  a 
range  of  problem  sizes.  The  smoothing  step  is  the  dominant  cost.  Pressure-correction 
methods  and  the  locally-coupled  explicit  method  are  equally  efficient  on  the  CM-5. 
V  cycling  is  found  to  be  much  cheaper  than  W  cycling,  and  a  truncation-error  based 
"full-multigrid"  procedure  is  found  to  be  a  computationally  efficient  and  convenient 
method  for  obtaining  the  initial  fine-grid  guess.  The  findings  presented  enable  further 
development  of  efficient,  scalable  pressure-based  parallel  computing  algorithms. 
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CHAPTER  1 
INTRODUCTION 


1.1     Motivations 


Computational  fluid  dynamics  (CFD)  is  a  growing  field  which  brings  together 
high-performance  computing,  physical  science,  and  engineering  technology.  The  dis- 
tinctions between  CFD  and  other  fields  such  as  computational  physics  and  computa- 
tional chemistry  are  largely  semantic  now,  because  increasingly  more  interdisplinary 
applications  are  coming  within  range  of  the  computational  capabilities.  CFD  algo- 
rithms and  techniques  are  mature  enough  that  the  focus  of  research  is  expected  to 
shift  in  the  next  decade  toward  the  development  of  robust  flow  codes,  and  toward  the 
application  of  these  codes  to  numerical  simulations  which  do  not  idealize  either  the 
physics  or  the  geometry  and  which  take  full  account  of  the  coupling  between  fluid 
dynamics  and  other  areas  of  physics  [65] .  These  applications  will  require  formidable 
resources,  particularly  in  the  areas  of  computing  speed,  memory,  storage,  and  in- 
put/output bandwidth  [78]. 

At  the  present  time,  the  computational  demands  of  the  applications  are  still 
at  least  two  orders-of-magnitude  beyond  the  computing  technology.  For  example, 
NASA's  grand  challenges  for  the  1990s  are  to  achieve  the  capability  to  simulate  vis- 
cous, compressible  flows  with  two-equation  turbulence  modeUing  over  entire  aircraft 
configurations,  and  to  couple  the  fluid  dynamics  simulation  with  the  propulsion  and 
aircraft  control  systems  modelling.  To  meet  this  challenge  it  is  estimated  that  1  ter- 
aflops  computing  speed  and  50  gigawords  of  memory  will  be  required  [24].  Current 
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massively-parallel  supercomputers,  for  example,  the  CM-5  manufactured  by  Thinking 
Machines,  have  peak  speeds  of  0(10  gigaflops)  and  memories  of  0(1  gigaword). 

Optimism  is  sometimes  circulated  that  teraflop  computers  may  be  expected  by 
1995  [68].  In  view  of  the  two  orders-of-magnitude  disparity  between  the  speed  of 
present-generation  parallel  computers  and  teraflops,  such  optimism  should  be  dimmed 
somewhat.  Expectations  are  not  being  met  in  part  because  the  applications,  which 
are  the  driving  force  behind  the  progress  in  hardware,  have  been  slow  to  develop.  The 
numerical  algorithms  which  have  seen  two  decades  of  development  on  traditional  vec- 
tor supercomputers  are  not  always  easy  targets  for  efficient  parallel  implementation. 
Better  understanding  of  the  basic  concepts  and  more  experience  with  the  present 
generation  of  parallel  computers  is  a  prerequisite  for  improved  algorithms  and  imple- 
mentations. 

The  motivation  of  the  present  work  has  been  the  opportunity  to  investigate  issues 
related  to  the  use  of  parallel  computers  in  CFD,  with  the  hope  that  the  knowledge 
gained  can  assist  the  transition  to  the  new  computing  technology.  The  context  of  the 
research  is  the  numerical  solution  of  the  2-d  incompressible  Navier-Stokes  equations, 
by  a  popular  and  proven  numerical  method  known  as  the  pressure-correction  tech- 
nique. A  specific  objective  emerged  as  the  research  progressed,  namely  to  develop 
and  analyze  the  performance  of  pressure-correction  methods  on  the  single-instruction 
stream/multiple-data  stream  (SIMD)  type  of  parallel  computer.  Single-grid  compu- 
tations were  studied  first,  then  a  multigrid  method  was  developed  and  tested. 

SIMD  computers  were  chosen  because  they  are  easier  to  program  than  multiple- 
instruction  stream/multiple-data  stream  (MIMD)  computers  (explict  message-passing 
is  not  required),  because  synchronization  of  the  processors  is  not  an  issue,  and  be- 
cause the  factors  affecting  the  parallel  run  time  and  computational  efficiency  are 
easier  to  identify  and  quantify.  Also,  these  are  arguably  the  most  powerful  machines 


available  right  now — Los  Alamos  National  Laboratory  has  a  1024-node  CM-5  with  32 
Gbytes  of  processor  memory  and  is  capable  of  32  Gflops  peak  speed.  Thus,  the  code, 
the  numerical  techniques,  and  the  understanding  which  are  the  contribution  of  this 
research  can  be  immediately  useful  for  applications  on  massively  parallel  computers. 

1.2     Governing  Equations 

The  governing  equations  for  2-d,  constant  property,  time-dependent  viscous  in- 
compressible flow  are  the  Navier-Stokes  equations.  They  express  the  principles  of 
conservation  of  mass  and  momentum.  In  primitive  variables  and  cartesian  coordi- 
nates, they  may  be  written 

1^  +  1^  =  0  (1.1) 

ox        oy 

dpu      dpu^      dpuv  dp        d^u        d^u  . 

dpv      dpuv      dpv'^  _     dp        d^v         d'^v 

'dr^~d^'^~df^~d^^^d^^^^di'  ^    ' 

where  u  and  v  are  cartesian  velocity  components,  p  is  the  density,  p  is  the  fluid's 
molecular  viscosity,  and  p  is  the  pressure.  Eq.  1.1  is  the  mass  continuity  equation,  also 
known  as  the  divergence-free  constraint  since  its  coordinate-free  form  is  div  u  =  0. 

The  Navier-Stokes  equations  1.1-1.3  are  a  coupled  set  of  nonlinear  partial  differ- 
ential equations  of  mixed  elliptic/parabolic  type.  Mathematically,  they  diff"er  from 
the  compressible  Navier-Stokes  equations  in  two  important  respects  that  lead  to  dif- 
ficulties for  devising  numerical  solution  techniques. 

First,  the  role  of  the  continuity  equation  is  different  in  incompressible  flow.  In- 
stead of  a  time-dependent  equation  for  the  density,  in  incompressible  fluids  the  conti- 
nuity equation  is  a  constraint  on  the  admissible  velocity  solutions.  Numerical  meth- 
ods must  be  able  to  integrate  the  momentum  equations  forward  in  time  while  simul- 
taneously maintaining  satisfaction  of  the  continuity  constraint.  On  the  other  hand. 


numerical  methods  for  compressible  flows  can  take  advantage  of  the  fact  that  in  the 
unsteady  form  each  equation  has  a  time-dependent  term.  The  equations  are  czist 
in  vector  form — any  suitable  method  for  time-integration  can  be  employed  on  the 
system  of  equations  as  a  whole. 

The  second  problem,  assuming  that  a  primitive- variable  formulation  is  desired,  is 
that  there  is  no  equation  for  pressure.  For  compressible  flows,  the  pressure  can  be  de- 
termined from  the  equation  of  state  of  the  fluid.  For  incompressible  flow,  an  auxiliary 
"pressure-Poisson"  equation  can  be  derived  by  taking  the  divergence  of  the  vector 
form  of  the  momentum  equations;  the  continuity  equation  is  invoked  to  eliminate 
the  unsteady  term  in  the  result.  The  formulation  of  the  pressure-Poisson  equation 
requires  manipulating  the  discrete  forms  of  the  momentum  and  continuity  equations. 
A  particular  discretization  of  the  Laplacian  operator  is  therefore  implied  in  pressure- 
Poisson  equation,  depending  on  the  discrete  gradient  and  divergence  operators.  This 
operator  may  not  be  implementable  at  boundaries,  and  solvability  constraints  can 
be  violated  [30].  Also,  the  differentiation  of  the  governing  equations  introduces  the 
need  for  additional  unphysical  boundary  conditions  on  the  pressure.  Physically,  the 
pressure  in  incompressible  flow  is  only  defined  relative  to  an  (arbitrary)  constant. 
Thus,  the  correct  boundary  conditions  are  Neumann.  However,  if  the  problem  has 
an  open  boundary,  the  governing  equations  should  be  supplemented  with  a  boundary 
condition  on  the  normal  traction  [29,  32], 

^n  =   -p+-^^-,  (1-4) 

Ke  on 

where  F  is  the  force.  Re  is  the  Reynolds  number,  and  the  subscript  n  indicates  the 
normal  direction.  However,  F„  may  be  difficult  to  prescribe. 


In  practice,  a  zero-gradient  or  linear  extrapolation  for  the  normal  velocity  com- 
ponent is  a  more  popular  outflow  boundary  condition.  Many  outflow  boundary  con- 
ditions have  been  analyzed  theoretically  for  incompressible  flow  (see  [30,  31,  38,  56]). 
There  are  even  more  boundary  condition  procedures  in  use.  The  method  used  and  its 
impact  on  the  "solvability"  of  the  resulting  numerical  systems  of  equations  depends 
on  the  discretization  and  the  numerical  method.  This  issue  is  treated  in  Chapter  2. 

1.3     Numerical  Methods  for  Viscous  Incompressible  Flow 

Numerical  algorithms  for  solving  the  incompressible  Navier-Stokes  system  of  equa- 
tions were  first  developed  by  Harlow  and  Welch  [39]  and  Chorin  [15,  16].  Descendants 
of  these  approaches  are  popular  today.  Harlow  and  Welch  introduced  the  important 
contribution  of  the  staggered-grid  location  of  the  dependent  variables.  On  a  stag- 
gered grid,  the  discrete  Laplacian  appearing  in  the  derivation  of  the  pressure-Poisson 
equation  has  the  standard  five-point  stencil.  On  colocated  grids  it  still  has  a  five- 
point  form  but,  if  the  central  point  is  located  at  (i,j),  the  other  points  which  are 
involved  are  located  at  (i-h2,j),  (i-2,j),  (i,j-|-2),  and  (i,j-2).  Without  nearest-neighbor 
linkages,  two  uncoupled  ("checkerboard")  pressure  fields  can  develop  independently. 
This  pressure-decoupling  can  cause  stability  problems,  since  nonphysical  discontinu- 
ities in  the  pressure  may  develop  [50].  In  the  present  work,  the  velocity  components 
are  staggered  one-half  of  a  control  volume  to  the  west  and  south  of  the  pressure  which 
is  defined  at  the  center  of  the  control  volume  as  shown  in  Figure  1.1.  Figure  1.1  also 
shows  the  locations  of  all  boundary  velocity  components  involved  in  the  discretization 
and  numerical  solution,  and  representative  boundary  control  volumes  for  u,  v,  and  p. 

In  Chorin's  artificial  compressibility  approach  [15]  a  time-derivative  of  pressure  is 
added  to  the  continuity  equation.  In  this  manner  the  continuity  equation  becomes 
an  equation  for  the  pressure,  and  all  the  equations  can  be  integrated  forward  in  time, 


either  as  a  system  or  one  at  a  time.  The  artificial  compressibility  method  is  closely 
related  to  the  penalty  formulation  used  in  finite-element  methods  [41].  The  equations 
are  solved  simultaneously  in  finite-element  formulations.  Penalty  methods  and  the 
artificial  compressibility  approach  suffer  from  ill-conditioning  when  the  equations 
have  strong  nonlinearities  or  source  terms.  Because  the  pressure  term  is  artificial, 
they  are  not  time-accurate  either. 

Projection  methods  [16,  62]  are  two-step  procedures  which  first  obtain  a  velocity 
field  by  integrating  the  momentum  equations,  and  then  project  this  vector  field  into 
a  divergence-free  space  by  subtracting  the  gradient  of  the  pressure.  The  pressure- 
Poisson  equation  is  solved  to  obtain  the  pressure.  The  solution  must  be  obtained 
to  a  high  degree  of  accuracy  in  unsteady  calculations  in  order  to  obtain  the  correct 
long-term  behavior  [76] — every  step  may  therefore  be  fairly  expensive.  Furthermore, 
the  time-step  size  is  limited  by  stability  considerations,  depending  on  the  impHcitness 
of  the  treatment  used  for  the  convection  terms. 

"Pressure-based"  methods  for  the  incompressible  Navier-Stokes  equations  include 
SIMPLE  [61]  and  its  variants,  SIMPLEC  [19],  SIMPLER  [60],  and  PISO  [43].  These 
methods  are  similar  to  projection  methods  in  the  sense  that  a  non-mass-conserving 
velocity  field  is  computed  first,  and  then  corrected  to  satisfy  continuity.  However,  they 
are  not  implicit  in  two  steps  because  the  nonlinear  convection  terms  are  linearized 
explicitly.  Instead  of  a  pressure-Poisson  equation,  an  approximate  equation  for  the 
pressure  or  pressure-correction  is  derived  by  manipulating  the  discrete  forms  of  the 
momentum  and  continuity  equations.  A  few  iterations  of  a  suitable  relaxation  method 
are  used  to  obtain  a  partial  solution  to  the  system  of  correction  equations,  and 
then  new  guesses  for  pressure  and  velocity  are  obtained  by  adding  the  corrections 
to  the  old  values.  This  process  is  iterated  until  all  three  equations  are  satisfied. 
The  iterations  require  underrelaxation  because  of  the  sequential  coupling  between 


variables.  Compared  to  projection  methods,  pressure-based  methods  are  less  implicit 
when  used  for  time-dependent  problems.  However,  they  can  be  used  to  seek  the 
steady-state  directly  if  desired. 

Compared  to  a  fully  coupled  strategy,  the  sequential  pressure-based  approach 
typically  has  slower  convergence  and  less  robustness  with  respect  to  Reynolds  num- 
ber. However,  the  sequential  approach  has  the  important  advantage  that  additional 
complexities,  for  example,  chemical  reaction,  can  be  easily  accommodated  by  simply 
adding  species-balance  equations  to  the  stack.  The  overall  run  time  increases  since 
each  governing  equation  is  solved  independently,  and  the  total  storage  requirements 
scale  linearly  with  the  number  of  equations  solved.  On  the  other  hand,  the  computer 
time  and  storage  requirements  escalate  faster  in  a  fully  coupled  solution  strategy.  The 
typical  way  around  this  problem  is  to  solve  simultaneously  the  continuity  and  momen- 
tum equations,  then  solve  any  additional  equations  in  a  sequential  fashion.  Without 
knowing  beforehand  that  the  pressure- velocity  coupling  is  the  strongest  among  all  the 
various  flow  variables,  however,  the  extra  computational  effort  spent  in  simultaneous 
solution  of  these  equations  is  unwarranted. 

There  are  other  approaches  for  solving  the  incompressible  Navier-Stokes  equa- 
tions, notably  methods  based  on  vorticity-streamfunction  {ui  —  ip)ov  velocity-vorticity 
(u  —  Lj)  formulations,  but  pressure-based  methods  are  easier,  especially  with  regard  to 
boundary  conditions  and  possible  extension  to  3-d  domains.  Furthermore,  they  have 
demonstrated  considerable  robustness  in  computing  incompressible  flows.  A  broad 
range  of  applications  of  pressure-based  methods  is  demonstrated  in  [73]. 

1.4     Parallel  Computing 

General  background  of  parallel  computers  and  their  application  to  the  numeri- 
cal solution  of  partial  differential  equations  is  given  in  Hockney  and  Jesshope  [40] 
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and  Ortega  and  Voigt  [58].  Fischer  and  Patera  [23]  gave  a  recent  review  of  parallel 
computing  from  the  perspective  of  the  fluid  dynamics  community.  Their  "indirect 
cost,"  the  parallel  run  time,  is  of  primary  interest  here.  The  "direct  cost''  of  parallel 
computers  and  their  components  is  another  matter  entirely.  For  the  iteration-based 
numerical  methods  developed  here,  the  parallel  run  time  is  the  cost  per  iteration 
multiplied  by  the  number  of  iterations.  The  latter  is  affected  by  the  characteristics  of 
the  particular  parallel  computer  used  and  the  algorithms  and  implementations  em- 
ployed. Parallel  computers  come  in  all  shapes  and  sizes,  and  it  is  becoming  virtually 
impossible  to  give  a  thorough  taxonomy.  The  background  given  here  is  limited  to  a 
description  of  the  type  of  computer  used  in  this  work. 

1.4.1     Data-Parallelism  and  SIMP  Computers 

Single-instruction  stream/multiple-data  stream  (SIMD)  computers  include  the 
connection  machines  manufactured  by  the  Thinking  Machines  Corporation,  the  CM 
and  CM-'2,  and  the  MP-1,  MP-2,  and  MP-3  computers  produced  by  the  MasPar  Cor- 
poration. These  are  massively-parallel  machines  consisting  of  a  front-end  computer 
and  many  processor/memory  pairs,  figuratively,  the  "back-end."  The  back-end  pro- 
cessors are  connected  to  each  other  by  a  "data  network."  The  topology  of  the  data 
network  is  a  major  feature  of  distributed-memory  parallel  computers. 

The  schematic  in  Figure  1.2  gives  the  general  idea  of  the  SIMD  layout.  The 
program  executes  on  the  serial  front-end  computer.  The  front-end  triggers  the  syn- 
chronous execution  of  the  "back-end"  processors  by  sending  "code  blocks"  simul- 
taneously to  all  processors.  Actually,  the  code  blocks  are  sent  to  an  intermediate 
"control  processor(s)."    The  control  processor  broadcasts  the  instructions  contained 


in  the  code  block,  one  at  a  time,  to  the  computing  processors.  These  "front-end- 
to-processor"  communications  take  time.  This  time  is  an  overhead  cost  not  present 
when  the  program  runs  on  a  serial  computer. 

The  operands  of  the  instructions,  the  data,  are  distributed  among  the  processors' 
memories.  Each  processor  operates  on  its  own  locally-stored  data.  The  "data"  in 
grid-based  numerical  methods  are  the  arrays,  2-d  in  this  case,  of  dependent  variables, 
geometric  quantities,  and  equation  coefficients.  Because  there  are  usually  plenty 
of  grid  points  and  the  same  governing  equations  apply  at  each  point,  most  CFD 
algorithms  contain  many  operations  to  be  performed  at  every  grid  point.  Thus  this 
"data-parallel"  approach  is  very  natural  to  most  CFD  algorithms. 

Many  operations  may  be  done  independently  on  each  grid  point,  but  there  is  cou- 
pling between  grid  points  in  physically-derived  problems.  The  data  network  enters 
the  picture  when  an  instruction  involves  another  processor's  data.  Such  "interpro- 
cessor"  communication  is  another  overhead  cost  of  solving  the  problem  on  a  parallel 
computer.  For  a  given  algorithm,  the  amount  of  interprocessor  communication  de- 
pends on  the  "data  mapping."  which  refers  to  the  partitioning  of  the  arrays  and  the 
assignment  of  these  "subgrids"  to  processors.  For  a  given  machine,  the  speed  of  the 
interprocessor  communication  depends  on  the  pattern  of  communication  (random  or 
regular)  and  the  distance  between  the  processors  (far  away  or  nearest-neighbor). 

The  run  time  of  a  parallel  program  depends  first  on  the  amount  of  front-end  and 
parallel  computation  in  the  algorithm,  and  the  speeds  of  the  front-end  and  back- 
end  for  doing  these  computations.  In  the  programs  developed  here,  the  front-end 
computations  are  mainly  the  program  control  statements  (IF  blocks,  DO  loops,  etc.). 
The  front-end  work  is  not  sped  up  by  parallel  processing.  The  parallel  computations 
are  the  useful  work,  and  by  design  one  hopes  to  have  enough  parallel  computation 
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to  amortize  both  the  front-end  computation  and  the  interprocessor  and  front-end-to- 
processor  communication,  which  are  the  other  factors  that  contribute  to  the  parallel 
run  time. 

From  this  brief  description  it  should  be  clear  that  SIMD  computers  have  four  char- 
acteristic speeds:  the  computation  speed  of  the  processors,  the  communication  speed 
between  processors,  and  the  speed  of  the  front-end-to-processor  communication,  i.e. 
the  speed  that  code  blocks  are  transferred,  and  the  speed  of  the  front-end.  These 
machine  characteristics  are  not  under  the  control  of  the  programmer.  However,  the 
amount  of  computation  and  communication  a  program  contains  is  determined  by  the 
programmer  because  it  depends  on  the  algorithm  selected  and  the  algorithm's  imple- 
mentation (the  choice  of  the  data  mapping,  for  example).  Thus,  the  key  to  obtaining 
good  performance  from  SIMD  computers  is  to  pick  a  suitable  algorithm,  "matched" 
in  a  sense  to  the  architecture,  and  to  develop  an  implementation  which  minimizes 
and  localizes  the  interprocessor  communication.  Then,  if  there  is  enough  parallel 
computation  to  amortize  the  serial  content  of  the  program  and  the  communication 
overheads,  the  speedup  obtained  will  be  nearly  the  number  of  processors.  The  actual 
performance,  because  it  depends  on  the  computer,  the  algorithm,  and  the  imple- 
mentation, must  be  determined  by  numerical  experiment  on  a  program-by-program 
basis. 

SIMD  computers  are  restricted  to  exploiting  data-parallelism,  as  opposed  to  the 
parallelism  of  the  tasks  in  an  algorithm.  The  task-parallel  approach  is  more  com- 
monly used,  for  example,  on  the  Cray  C90  supercomputer.  Multiple-instruction 
stream/multiple-data  stream  (MIMD)  computers,  on  the  other  hand,  are  composed  of 
more-or-less  autonomous  processor/memory  pairs.  Examples  include  the  Intel  series 
of  machines  (iPSC/2,  iPSC/860,  and  Paragon),  workstation  clusters,  and  the  connec- 
tion machine  CM-5.    However,  in  CFD,  the  data-parallel  approach  is  the  prevalent 
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one  even  on  MIMD  computers.  The  front-end /back-end  programming  paradigm  is 
implemented  by  selecting  one  processor  to  initiate  programs  on  the  other  processors, 
accumulate  global  results,  and  enforce  synchronization  when  necessary,  a  strategy 
called  single-program-multiple-data  (SPMD)  [23].  The  CM-5  has  a  special  "control 
network"  to  provide  automatic  synchronization  of  the  processor's  execution,  so  a 
SIMD  programming  model  can  be  supported  as  well  as  MIMD.  SIMD  is  the  manner 
in  which  the  CM-5  has  been  used  in  the  present  work.  The  advantage  to  using  the 
CM-5  in  the  SIMD  mode  is  that  the  programmer  does  not  have  to  explicitly  specify 
message-passing.  This  simplification  saves  effort  and  increases  the  effective  speed  of 
communication  because  certain  time-consuming  protocols  for  the  data  transfer  can 
be  eliminated. 

1.4.2     Algorithms  and  Performance 

The  previous  subsection  discussed  data-parallelism  and  SIMD  computers,  i.e. 
what  parallel  computing  means  in  the  present  context  and  how  it  is  carried  out 
by  SIMD-type  computers.  To  develop  programs  for  SIMD  computers  requires  one 
to  recognize  that  unlike  serial  computers,  parallel  computers  are  not  black  boxes.  In 
addition  to  the  selection  of  an  algorithm  with  ample  data-parallehsm,  consideration 
must  be  given  to  the  implementation  of  the  algorithm  in  specific  ways  in  order  to 
achieve  the  desired  benefits  (speedups  over  serial  computations). 

The  success  of  the  choice  of  algorithm  and  the  implementation  on  a  particular 
computer  is  judged  by  the  "speedup"  (S)  and  "efficiency"  {E)  of  the  program.  The 
communications  mentioned  above,  front-end-to-processor  and  interprocessor,  are  es- 
sentially overhead  costs  associated  with  the  SIMD  computational  model.  They  would 
not  be  present  if  the  algorithm  were  implemented  on  a  serial  computer,  or  if  such 
communications  were  infinitely  fast.  If  the  overhead  cost  was  zero,  a  parallel  program 
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executing  on  Up  processors  would  run  Up  times  faster  than  on  a  single  processor,  a 
speedup  of  Up.  This  idealized  case  would  also  have  a  parallel  efficiency  of  1.  The 
parallel  efficiency  E  measures  the  actual  speedup  in  comparison  with  the  ideal. 

One  is  also  interested  in  how  speedup,  efficiency,  and  the  parallel  run  time  [Tp) 
scale  with  problem  size,  and  with  the  number  of  processors  used.  The  objective  in 
using  parallel  computers  is  more  than  just  obtaining  a  good  speedup  on  a  particular 
problem  size  and  a  particular  number  of  processors.  For  parallel  CFD,  the  goals  are 
to  either  (1)  reduce  the  time  (the  indirect  cost  [23])  to  solve  problems  of  a  given 
complexity,  to  satisfy  the  need  for  rapid  turnaround  times  in  design  work,  or  (2) 
increase  the  complexity  of  problems  which  can  be  solved  in  a  fixed  amount  of  time. 
For  the  iteration-based  numerical  methods  studied  here,  there  are  two  considerations: 
the  cost  per  iteration,  and  the  number  of  iterations,  respectively,  computational  and 
numerical  factors.  The  total  run  time  is  the  product  of  the  two. 

Gustafson  [35]  has  presented  fixed-size  and  scaled-size  experiments  whose  results 
describe  how  the  cost  per  iteration  scales  on  a  particular  machine.  In  the  fixed- 
size  experiment,  the  efficiency  is  measured  for  a  fixed  problem  size  as  processors  are 
added.  The  hope  is  that  the  run  time  is  halved  when  the  number  of  processors  is 
doubled.  However,  the  run  time  obviously  cannot  be  reduced  indefinitely  by  adding 
more  processors  because  at  some  point  the  parallelism  runs  out — the  limit  to  the 
attainable  speedup  is  the  number  of  grid  points.  In  the  scaled-size  experiment,  the 
problem  size  is  increased  along  with  the  number  of  processors,  to  maintain  a  constant 
local  problem  size  for  each  of  the  parallel  processors.  Care  must  be  taken  to  make 
timings  on  a  per  iteration  basis  if  the  number  of  iterations  to  reach  the  end  of  the 
computation  increases  with  the  problem  size.  The  hope  in  such  an  experiment  is  that 
the  program  will  maintain  a  certain  high  level  of  parallel  efficiency  E.    The  ability 
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to  maintain  E  in  the  scaled-size  experiment  indicates  that  the  additional  processors 
increased  the  speedup  in  a  one-for-one  trade. 

1.5     Pressure-Based  Multigrid  Methods 

Multigrid  methods  are  a  potential  route  to  both  computationally  and  numerically 
scalable  programs.  Their  cost  per  iteration  on  parallel  computers  and  convergence 
rate  is  the  subject  of  Chapters  4-5.  For  sufficiently  smooth  elliptic  problems,  the 
convergence  rate  of  multigrid  methods  is  independent  of  the  problem  size — their  op- 
eration count  is  0{N).  In  practice,  good  convergence  rates  are  maintained  as  the 
problem  size  increases  for  Navier-Stokes  problems,  also,  provided  suitable  multigrid 
components — the  smoother,  restriction  and  prolongation  procedures — and  multigrid 
techniques  are  employed.  The  standard  V-cycle  full-multigrid  (FMG)  algorithm  has 
an  almost  optimal  operation  count,  0{log^N)  for  Poisson  equations,  on  parallel  com- 
puters. Provided  the  multigrid  algorithm  is  implemented  efficiently  and  that  the  cost 
per  iteration  scales  well  with  the  problem  size  and  the  number  of  processors,  the 
multigrid  approach  seems  to  be  a  promising  way  to  exploit  the  increased  computa- 
tional capabilities  that  parallel  computers  offer. 

The  pressure-based  methods  mentioned  previously  involve  the  solution  of  three 
systems  of  linear  algebraic  equations,  one  each  for  the  two  velocity  components 
and  one  for  the  pressure,  by  standard  iterative  methods  such  as  successive  line- 
underrelaxation  (SLUR).  Hence  they  inherit  the  convergence  rate  properties  of  these 
solvers,  i.e.  as  the  problem  size  grows  the  convergence  rate  deteriorates.  With  the 
single-grid  techniques,  therefore,  it  will  be  difficult  to  obtain  reasonable  turnaround 
times  when  the  problem  size  is  increased  into  the  target  range  for  parallel  com- 
puters. Multigrid  techniques  for  accelerating  the  convergence  of  pressure-correction 
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methods  should  be  pursued,  and  in  fact  they  have  been  within  the  last  five  or  so 
years  [70,  74,  80]. 

However,  there  are  still  many  unsettled  issues.  The  complexities  affecting  the 
convergence  rate  of  single-grid  calculations  carry  over  to  the  multigrid  framework 
and  are  compounded  there  by  the  coupling  between  the  evolving  solutions  on  multiple 
grid  levels,  and  by  the  particular  "grid-scheduling"  used. 

Linear  multigrid  methods  have  been  applied  to  accelerate  the  convergence  rate  for 
the  solution  of  the  system  of  pressure  or  pressure-correction  equations  [4,  22,  42,  64, 
94].  However,  the  overall  convergence  rate  does  not  significantly  improve  because  the 
velocity-pressure  coupling  is  not  addressed  [4,  22].  Therefore  the  multigrid  strategy 
should  be  applied  on  the  "outer  loop,"  with  the  role  of  the  iterative  relaxation  method 
played  by  the  numerical  methods  described  above,  e.g.  the  projection  method  or  the 
pressure-correction  method.  Thus,  the  generic  term  "smoother"  is  prescribed  because 
it  reflects  the  purpose  of  the  solution  of  the  coupled  system  of  equations  going  on 
inside  the  multigrid  cycle — to  smooth  the  residual  so  that  an  accurate  coarse-grid 
approximation  of  the  fine-grid  problem  is  possible.  It  is  not  true  that  a  good  solver, 
one  with  a  fast  convergence  rate  on  single-grid  computations,  is  necessarily  a  good 
smoother  of  the  residual.  It  is  therefore  of  interest  to  assess  pressure-correction  meth- 
ods as  potential  multigrid  smoothers.  See  Shyy  and  Sun  [74]  for  more  information 
on  the  staggered-grid  implementation  of  multigrid  methods,  and  some  encouraging 
results. 

Staggered  grids  require  special  techniques  [21,  74]  for  the  transfer  of  solutions  and 
residuals  between  grid  levels,  since  the  positions  of  the  variables  on  different  levels 
do  not  correspond.  However,  they  alleviate  the  "checkerboard"  pressure  stability 
problem  [50],  and  since  techniques  have  already  been  established  [74],  there  is  no 
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reason  not  to  go  this  route,  especially  when  cartesian  grids  are  used  as  in  the  present 
work. 

Vanka  [89]  has  proposed  a  new  numerical  method  as  a  smoother  for  multigrid 
computations,  one  which  has  inferior  convergence  properties  as  a  single-grid  method 
but  apparently  yields  an  effective  multigrid  method.  A  staggered-grid  finite- volume 
discretization  is  employed.  In  Vanka's  smoother,  the  velocity  components  and  pres- 
sure of  each  control  volume  are  updated  simultaneously,  so  it  is  a  coupled  approach, 
but  the  coupling  between  control  volumes  is  not  taken  into  account,  so  the  calcu- 
lation of  new  velocities  and  pressures  is  explicit.  This  method  is  sometimes  called 
the  "locally-coupled  explicit"  or  "block-explicit"  pressure-based  method.  The  control 
volumes  are  visited  in  lexicographic  order  in  the  original  method  which  is  therefore 
aptly  called  BGS  (block  Gauss-Seidel).  Line-variants  have  been  developed  to  couple 
the  flow  variables  in  neighboring  control  volumes  along  lines  (see  [80,  87]). 

Linden  et  al.[50]  gave  a  brief  survey  of  multigrid  methods  for  the  steady-state  in- 
compressible Navier-Stokes  equations.  They  argue  without  analysis  that  BGS  should 
be  preferred  over  the  pressure-correction  type  methods  since  the  strong  local  cou- 
pling is  likely  to  have  better  success  smoothing  the  residual  locally.  On  the  other 
hand,  Sivaloganathan  and  Shaw  [71,  70]  have  found  good  smoothing  properties  for 
the  pressure-correction  approach,  although  the  analysis  was  simplified  considerably. 
Sockol  [80]  has  compared  the  point  and  line-variants  of  BGS  with  the  pressure- 
correction  methods  on  serial  computers,  using  model  problems  with  different  physical 
characteristics.  SIMPLE  and  BGS  emerge  as  favorites  in  terms  of  robustness  with 
BGS  preferred  due  to  a  lower  cost  per  iteration.  This  preference  may  or  may  not 
carry  over  to  SIMD  parallel  computers  (see  Chapter  4  for  comparison).  Interesting 
applications  of  multigrid  methods  to  incompressible  Navier-Stokes  flow  problems  can 
be  found  in  [12,  28,  48,  54]. 
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In  terms  of  parallel  implementations  there  are  far  fewer  results  although  this 
field  is  rapidly  growing.  Simon  [77]  gives  a  recent  cross-section  of  parallel  CFD 
results.  Parallel  multigrid  methods,  not  only  in  CFD  but  as  a  general  technique 
for  partial  differential  equations,  have  received  much  attention  due  to  their  desirable 
0(A'^)  operation  count  on  Poisson  equations.  However,  it  is  apparently  difficult  to  find 
or  design  parallel  computers  with  ideal  communication  networks  for  multigrid  [13]. 
Consequently  implementations  have  been  pursued  on  a  variety  of  machines  to  see 
what  performance  can  be  obtained  with  the  present  generation  of  parallel  machines, 
and  to  identify  and  understand  the  basic  issues.  Dendy  et  al.[18]  have  recently 
described  a  multigrid  method  on  the  CM-2.  However,  to  accommodate  the  data- 
parallel  programming  model  they  had  to  dimension  their  array  data  on  every  grid  level 
to  the  dimension  extents  of  the  finest  grid  array  data.  This  approach  is  very  wasteful 
of  storage.  Consequently  the  size  of  problems  which  can  be  solved  is  greatly  reduced. 
Recently  an  improved  release  of  the  compiler  has  enabled  the  storage  problem  to  be 
circumvented  with  some  programming  diligence  (see  Chapter  5).  The  implementation 
developed  in  this  work  is  one  of  the  first  to  take  advantage  of  the  new  compiler  feature. 

In  addition  to  parallel  implementations  of  serial  multigrid  algorithms,  several 
novel  multigrid  methods  have  been  proposed  for  SIMD  computers  [25,  26,  33].  Some 
of  the  algorithms  are  instrinsically  parallel  [25,  26]  or  have  increased  parallelism 
because  they  use  multiple  coarse  grids,  for  example  [33].  These  efforts  and  others 
have  been  recently  reviewed  [14,  53,  92].  Most  of  the  new  ideas  have  not  been 
developed  yet  for  solving  the  incompressible  Navier-Stokes  equations. 

One  of  the  most  prominent  concerns  addressed  in  the  literature  regarding  parallel 
implementations  of  serial  multigrid  methods  is  the  coarse  grids.  When  the  number 
of  grid  points  is  smaller  than  the  number  of  processors  the  parallelism  is  reduced 
to  the  number  of  grid  points.    This  loss  of  parallelism  may  significantly  affect  the 
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parallel  efficiency.  One  of  the  routes  around  the  problem  is  to  use  multiple  coarse 
grids  [59,  33,  79].  Another  is  to  alter  the  grid-scheduling  to  avoid  coarse  grids.  This 
approach  can  lead  to  computationally  scalable  implementations  [34,  49]  but  may 
sacrifice  the  convergence  rate.  "Agglomeration"  is  an  efficiency-increasing  technique 
used  in  MIMD  multigrid  programs  which  refers  to  the  technique  of  duplicating  the 
coarse  grid  problem  in  each  processor  so  that  computation  proceeds  independently 
(and  redundantly).  Such  an  approach  can  also  be  scalable  [51].  However,  most  atten- 
tion so  far  has  focused  on  parallel  implementations  of  serial  multigrid  algorithms,  in 
particular  on  assessing  the  importance  of  the  coarse-grid  smoothing  problem  for  dif- 
ferent machines  and  on  developing  techniques  to  minimize  the  impact  on  the  parallel 
efficiency. 

1.6     Description  of  the  Research 

The  dissertation  is  organized  as  follows.  Chapter  2  discusses  the  role  of  the  mass 
conservation  in  the  numerical  consistency  of  the  single-grid  SIMPLE  method  for  open 
boundary  problems,  and  explains  the  relevance  of  this  issue  to  the  convergence  rate. 
In  Chapter  3  the  single-grid  pressure-correction  method  is  implemented  on  the  MP-1, 
CM-2,  and  CM-5  computers  and  its  performance  is  analyzed.  High  parallel  efficien- 
cies are  obtained  at  speeds  and  problem  sizes  well  beyond  the  current  performance  of 
such  algorithms  on  traditional  vector  supercomputers.  Chapter  4  develops  a  multigrid 
numerical  method  for  the  purpose  of  accelerating  the  single-grid  pressure-correction 
method  and  maintaining  the  accelerated  convergence  property  independent  of  the 
problem  size.  The  multigrid  smoother,  the  intergrid  transfer  operators,  and  the  sta- 
bilization strategy  for  Navier-Stokes  computations  are  discussed.  Chapter  5  describes 
the  actual  implementation  of  the  multigrid  algorithm  on  the  CM-5,  its  convergence 
rate,  and  its  parallel  run  time  and  scalability.  The  convergence  rate  depends  on  the 
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flow  problem  and  the  coarse-grid  discretization,  among  other  factors.  These  factors 
are  considered  in  the  context  of  the  "fuU-multigrid"  (FMG)  starting  procedure  by 
which  the  initial  guess  on  the  fine  grid  is  obtained.  The  cost  of  the  FMG  proce- 
dure is  a  concern  for  parallel  computation  [88],  and  this  issue  is  also  addressed.  The 
results  indicate  that  the  FMG  procedure  may  influence  the  asymptotic  convergence 
rate  and  the  stability  of  the  multigrid  iterations.  Concluding  remarks  in  each  chapter 
summarize  the  progress  made  and  suggest  avenues  for  further  study. 
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'1  . 


Figure  1.1.  Staggered-grid  layout  of  dependent  variables,  for  a  small  but  complete 
domain.  Boundary  values  involved  in  the  computation  are  shown.  Representative  u, 
V,  and  pressure  boundary  control  volumes  are  shaded. 
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Front  End  (CM-2  and  MP-1) 

Partition  Manager  (CM-5) 

->  serial  code,  control  code,  scalar  data 


short  blocks 
of  parallel  code 


Sequencer  (CM-2) 
Array  control  unit  (MP-1) 
Multiple  SPARC  nodes  (CM-5; 


individual  instructio 


RE. 

RE. 

RE. 

RE. 

more  RE.s 


array  data  partitioned  among  processor  memories 


Interprocessor  communication  network 
hypercube  (CM-2)  +  "NEWS" 
3-stage  crossbar  (MP-1)  +  "X-Net" 
fat  tree  (CM-5) 


Figure  1.2.  Layout  of  the  MP-1,  CM-2,  and  CM-5  SIMD  computers. 


CHAPTER  2 
PRESSURE-CORRECTION  METHODS 


2.1     Finite- Volume  Discretization  on  Staggered  Grids 

The  formulation  of  the  numerical  method  used  in  this  work  begins  with  the  inte- 
gration of  the  governing  equations  Eq  1.1-1.3  over  each  of  the  control  volumes  in  the 
computational  domain.  Figure  1.1  shows  a  model  computational  domain  with  u,  v, 
and  p  (cell-centered)  control  volumes  shaded.  The  continuity  equation  is  integrated 
over  the  p  control  volumes. 

Consider  the  discretization  of  the  u-momentum  equation  for  the  control  volume 
shown  in  Figure  2.1  whose  dimensions  are  Ax  and  Ay.  The  v  control  volumes  are 
done  exactly  the  same  except  rotated  90°.  Integration  of  Eq.  1.2  over  the  shaded 
region  is  interpreted  as  follows  for  each  of  the  terms: 


—  dx  dy  =  —-—Ax  Ay, 


J  J 

J  J  -Q^  dx  dy  =  i^pul  -  pul)  Ay, 


II 


dpu^ 
dpuv 


dy 


dx  dy  =  {pUnVn  -  pUsVs)  Ax, 


II 


dp 
dx 


dxdy  =  -(pe  -Pu;)Ay, 


// 
// 


d^^ 


p-^—dxdy  =  \p  -— 


dx^ 
d^u 


M 


dx 
—  dxdy=(^,^ 


-^Yx 


^ 


du 
di 
du 
dy 


Ay 


Ax 


(2.1) 

(2.2) 
(2.3) 
(2.4) 
(2.5) 

(2.6) 


The  lowercase  subscripts  e,  w.  n,  s  indicate  evaluation  on  the  control  volume  faces. 

By  convention  and  the  mean-value  theorem,  these  are  at  the  midpoint  of  the  faces. 

The  subscript  P  in  Eq.  2.1  indicates  evaluation  at  the  center  of  the  control  volume. 

21 


22 


Because  of  the  staggered  grid,  the  required  pressure  values  in  Eq.  2.4  are  already 
located  on  the  u  control  volume  faces.  The  pressure-gradient  term  is  effectively  a 
second-order  central-difference  approximation.  With  colocated  grids,  however,  the 
control-volume  face  pressures  are  obtained  by  averaging  the  nearby  pressures.  This 
averaging  results  in  the  pressure  at  the  cell  center  dropping  out  of  the  expression 
for  the  pressure  gradient.  The  central-difference  in  Eq.  2.4  is  effectively  taken  over 
a  distance  2Ax  on  colocated  grids.  Thus  staggered  cartesian  grids  provide  a  more 
accurate  approximation  of  the  pressure-gradient  term  since  the  difference  stencil  is 
smaller. 

The  next  step  is  to  approximate  the  terms  which  involve  values  at  the  control 
volume  faces.  In  Eq.  2.2,  one  of  the  Ug  and  one  of  the  u^,  are  replaced  by  an  average 
of  neighboring  values, 

/        2  2\    A  f     UE  +  Up  Up  +  UW         \     .  ,r,^^ 

[pUe  -  pu^j  Ay  =  (p 2 "^  ~  ^ 2 ""^J  ^  ^      ' 

and  in  Eq.  2.3,  f„  and  v^  are  obtained  by  averaging  nearby  values, 

[pUnVn  -  pUsV^)  Ax  =    Ip U„  -  p U^ j  Ax  (2.8) 

The  remaining  face  velocities  in  the  convection  terms,  u„,  Us,  Ug.  and  ?/„,,  are  ex- 
pressed as  a  certain  combination  of  the  nearby  u  values — which  u  values  are  involved 
and  what  weighting  they  receive  is  prescribed  by  the  convection  scheme.  Some  pop- 
ular recirculating  flow  convection  schemes  are  described  in  [73,  75]. 

The  control- volume  face  derivatives  in  the  diffusion  terms  are  evaluated  by  central 
differences, 

"       "    .  f    UE-up         up-uw\   .  ,r,  a\ 

/    UN  -Up  up  -us\    ^  ,  . 

Ax=U-^^-p^^    Ax  (2.10) 


du 
^dx 

du 
dx 

e                          X 

du 

'ay 

du 

.-'ay 

23 


The  unsteady  term  in  Eq.  2.1  is  approximated  by  a  backward  Euler  scheme.  Ail  the 
terms  are  evaluated  at  the  "new"  time  level,  i.e.  implicitly. 

Thus,  the  discretized  momentum  equations  for  each  control  volume  can  be  put 
into  the  following  general  form, 

apup  =  ueUe  +  awuw  +  o.nun  +  asus  +  b,  (2.11) 

where  b  =  (p„,  —pe)Ay  +  pup/  At,  the  superscript  n  indicating  the  previous  time-step. 
The  coefficients  a^r,  as,  etc.  are  comprised  of  the  terms  which  modify  ur^,  us,  etc.  in 
the  discretized  convection  and  diffusion  terms. 

The  continuity  equation  is  integrated  over  a  pressure  control  volume. 


// 


dpu      dpv 


dx  dy  =  p{ue  -  Uuj)Ay  +  p{vn  -  Vs)Ax  =  0.  (2.12) 


dx        dy 

Again  the  staggered  grid  is  an  advantage  because  the  normal  velocity  components  on 
each  control  volume  face  are  already  in  position — there  is  no  need  for  interpolation. 

2.2     The  SIMPLE  Method 

One  SIMPLE  iteration  takes  initial  velocity  and  pressure  fields  {u',v*,p*)  and 
computes  new  guesses  {u,v,p).  The  intermediate  values  are  denoted  with  a  tilde, 
{u,v,p).  In  the  algorithm  below,  d^{u*,v*),  for  example,  means  that  the  a^  coeffi- 
cient in  the  t/-momentum  equation  depends  on  u*  and  v*.  The  parameters  Uu,  Vy,  s-nd 
Uc  are  the  numbers  of  "inner"  iterations  to  be  taken  for  the  u,  v,  and  continuity  equa- 
tions, respectively.  This  notation  will  be  clarified  by  the  following  discussion.  The 
inner  iteration  count  is  indicated  by  the  superscript  enclosed  in  parentheses.  Finally, 
ijjuy  and  ujc  are  the  relaxation  factors  for  the  momentum  and  continuity  equations. 

SIMPLE  {u'',v',p*;Vu,Vy,v.p,uiuv,i^c) 

Compute  u  coefficients  al{u* ,v*)  (k  =  P,E,W,N,S)  and  source  term  b''{u*,p*) 
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for  each  discrete  u-momentum  equation: 

^up  =  a%UN  +  agiis  +  (Ie^e  +  aw^w  +  6"  +  (1  -  u;uv)^u*p 
Do  i/„  iterations  to  obtain  an  approximate  solution  for  ii 
starting  with  u*  as  the  initial  guess 

u(")  =  Gu("-i)  +  /" 

Compute  V  coefficients  al{u,v*)  (k  =  E,W,N,S)  and  source  term  b^iv'^p*) 
for  each  discrete  u-momentum  equation: 

^vp  =  al,VN  +  a"sVs  +  a"^VE  +  a"y^,vw  +  6^^  +  (1  -  ^uv)^v*p 
Do  Vy  iterations  to  obtain  an  approximate  solution  for  v 
starting  with  v*  as  the  initial  guess 

i,  =  ?;("=''") 

Compute  p'  coefficients  al  (k  =  P,E,W,N,S)  and  source  term  b^lii^v) 
for  each  discrete  p'  equation: 

a'pp'p  =  a%p'^  +  a'sP's  +  o-eV'e  +  «vkP'h'  +  ^' 
Do  Vc  iterations  to  obtain  an  approximate  solution  for  p' 
starting  with  zero  as  the  initial  guess 

Correct  u,v^  and  p*  at  every  interior  grid  point 

up   -  up  -t  (^u  )p 

PP  =Pp+  ^cP'  p 
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The  algorithm  is  not  as  complicated  as  it  looks.  The  important  point  to  note  is 
that  the  major  tasks  to  be  done  are  the  computing  of  coefficients  and  the  solving  of 
the  systems  of  equations.  The  symbol  G  indicates  the  iteration  matrix  of  whatever 
type  relaxation  is  used  on  these  inner  iterations  (SLUR  in  this  case),  and  /  is  the 
corresponding  source  term. 

In  the  SIMPLE  pressure-correction  method  [61],  the  averages  in  Eq.  2.7  and  2.8 
are  lagged  in  order  to  linearize  the  resulting  algebraic  equations.  The  governing 
equations  are  solved  sequentially.  First,  the  u  momentum  equation  coefficients  are 
computed  and  an  updated  u  field  is  computed  by  solving  the  system  of  linear  alge- 
braic equations.  The  pressures  in  Eq.  2.4  are  lagged.  The  v  momentum  equation  is 
solved  next  to  update  v.  The  continuity  equation,  recast  in  terms  of  pressure  correc- 
tions, is  then  set  up  and  solved.  These  pressure  corrections  are  coupled  to  velocity 
corrections.  Together  they  are  designed  to  correct  the  velocity  field  so  that  it  satisfies 
the  continuity  constraint,  while  simultaneously  correcting  the  pressure  field  so  that 
momentum  conservation  is  maintained. 

The  relationship  between  the  velocity  and  pressure  corrections  is  derived  from 
the  momentum  equation,  as  described  in  the  next  section.  The  resulting  system 
of  equations  is  fully  coupled,  as  one  might  expect  knowing  the  elliptic  nature  of 
pressure  in  incompressible  fluids,  and  is  therefore  expensive  to  solve.  However,  if  the 
resulting  system  of  pressure-correction  equations  were  solved  exactly,  the  divergence- 
free  constraint  and  the  momentum  equations  (with  old  values  of  u  and  v  present  in 
the  nonlinear  convection  terms)  would  be  satisfied.  This  approach  would  constitute 
an  implicit  method  of  time  integration  for  the  linearized  equations.  The  time-step 
size  would  have  to  be  limited  to  avoid  stability  problems  caused  by  the  linearization. 

To  reduce  the  computational  cost,  the  SIMPLE  prescription  is  to  use  an  approx- 
imate relationship  between  the  velocity  and  pressure  corrections  (hence  the  label 
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"semi-implicit").  Variations  on  the  original  SIMPLE  approximation  have  shown  bet- 
ter convergence  rates  for  simple  flow  problems,  but  in  discretizations  on  curvilinear 
grids  and  other  problems  with  significant  contributions  from  source  terms,  the  per- 
formance is  no  better  than  the  original  SIMPLE  method  (see  the  results  in  [4]). 

The  goal  of  satisfying  the  divergence-free  constraint  can  still  be  attained,  if  the 
system  of  pressure-correction  equations  is  converged  to  strict  tolerances,  because  the 
discrete  continuity  equations  are  still  being  solved.  But  satisfaction  of  the  momentum 
equations  cannot  be  maintained  with  the  approximate  relationship.  Consequently  it 
is  no  longer  desirable  to  solve  the  p'-system  of  equations  to  strict  tolerances.  It- 
erations are  necessary  to  find  the  right  velocities  and  pressures  which  satisfy  all 
three  equations.  Furthermore,  since  the  equation  coefficients  are  changing  from  one 
iteration  to  the  next,  it  is  pointless  to  solve  the  momentum  equations  to  strict  tol- 
erances. In  practice,  only  a  few  iterations  of  a  standard  scheme  such  as  successive 
line-underrelaxation  (SLUR)  are  performed. 

The  single  "outer"  iteration  outlined  above  is  repeated  many  times,  with  under- 
relaxation  to  prevent  the  iterations  from  diverging.  In  this  sense  a  two-level  iterative 
procedure  is  being  employed.  In  the  outer  iterations,  the  momentum  and  pressure- 
correction  equations  are  iteratively  updated  based  on  the  linearized  coefficients  and 
sources,  and  inner  iterations  are  applied  to  partially  solve  the  systems  of  linear  alge- 
braic equations. 

The  fact  that  only  a  few  inner  iterations  are  taken  on  each  system  of  equations  sug- 
gests that  the  asymptotic  convergence  rate  of  the  iterative  solver,  which  is  the  usual 
means  of  comparison  between  solvers,  does  not  necessarily  dictate  the  convergence 
rate  of  the  outer  iterative  process.  Braaten  and  Shyy  [4]  have  found  that  the  con- 
vergence rate  of  the  outer  iterations  actually  decreases  when  the  pressure-correction 
equation  is  solved  to  a  much  stricter  tolerance  than  the  momentum  equations.  They 


concluded  that  the  balance  between  the  equations  is  important.  Because  w,  f,  and 
•p'  are  segregated,  the  overall  convergence  rate  is  strongly  dependent  on  the  partic- 
ular flow  problem,  the  grid  distribution  and  quality,  and  the  choice  of  relaxation 
parameters. 

In  contrast  to  projection  methods,  which  are  two-step  but  treat  the  convection 
terms  explicitly  (or  more  recently  by  solving  a  Riemann  problem  [2])  and  are  therefore 
restricted  from  taking  too  large  a  time-step,  the  pressure-correction  approach  is  fully 
implicit  with  no  time-step  limitation,  but  many  iterations  may  be  necessary.  The 
projection  methods  are  formalized  as  time-integration  techniques  for  semi-discrete 
equations.  SIMPLE  is  an  iterative  method  for  solving  the  discretized  Navier-Stokes 
system  of  coupled  nonlinear  algebraic  equations.  But  the  details  given  above  should 
make  it  clear  that  these  techniques  bear  strong  similarities — specifically,  a  single 
SIMPLE  iteration  would  be  a  projection  method  if  the  system  of  pressure-correction 
equations  was  solved  to  strict  tolerances  at  each  iteration.  It  would  be  interesting  to 
do  some  numerical  comparisons  between  projection  methods  and  pressure-correction 
methods  to  further  clarify  the  similarity. 

2.3     Discrete  Formulation  of  the  Pressure-Correction  Equation 

The  discrete  pressure-correction  equation  is  obtained  from  the  discrete  momentum 
and  continuity  equations  as  follows.  The  velocity  field  which  has  been  newly  obtained 
by  solving  the  momentum  equations  was  denoted  by  (u,  v)  earlier.  The  pressure  field 
after  the  momentum  equations  are  solved  still  has  the  initial  value  p*.  So  u,  u,  and 
■p*  satisfy  the  u-momentum  equation 

apup  =  aEUE  +  awuw  +  aj^un  +  clsUs  +  (K.  "  vV)^V^  (2.13) 

and  the  corresponding  u-momentum  equation.  The  corrected  (continuity-satisfying) 
velocity  field  [u^v)  satisfies  the  ^i-momentum  equation  with  the  corrected  pressure 
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field  p, 

apup  =  ueUe  +  awuw  +  a^UM  +  asus  +  (Pw  -  Pe)^y,  (2-14) 

and  likewise  for  the  u-momentum  equation.  Additive  corrections  are  assumed,  i.e. 

u  =  u  +  u'  (2.15) 

v  =  v  +  v'  (2.16) 

P  =  P*+/.  (2.17) 

Subtracting  Eq.  2.13  from  Eq.  2.14  gives  the  desired  relationship  between  pressure 
and  the  u  corrections, 

apu'p=        Y.       O'ku'k  +  ipl-p'J^y,  (2.18) 

k=E,W,N,S 

with  a  similar  expression  for  the  v  corrections. 

If  Eq.  2.18  is  used  as  is,  then  the  nearby  velocity  corrections  in  the  summation  need 
to  be  replaced  by  similar  expressions  involving  pressure-corrections.  This  requirement 
brings  in  more  velocity  corrections  and  more  pressure  corrections,  and  so  on,  leading 
to  an  equation  which  involves  the  pressure  corrections  at  every  grid  point.  The 
resulting  system  of  equations  would  be  expensive  to  solve.  Thus,  the  summation 
term  is  dropped  in  order  to  obtain  a  compact  expression  for  the  velocity  correction  in 
terms  of  pressure  corrections.  At  convergence,  the  pressure  corrections  (and  therefore 
the  velocity  corrections)  go  to  zero,  so  the  precise  form  of  the  approximate  pressure- 
velocity  correction  relationship  does  not  figure  in  the  final  converged  solution. 

The  discrete  form  of  the  pressure-correction  equation  follows  by  first  substituting 
the  simplified  version  of  Eq.  2.18  into  Eq.  2.15, 

up  =  up  +  Up  =  up  +  (p^  -  Pe)Ay,  (2.19) 
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and  then  substituting  this  into  the  continuity  equation  Eq.  2.12,  (with  an  analogous 
formula  for  vp).  The  result  is 

"^"^  (p'p-py-4^(^-p;=)  +  ^b^-A)-^(p's-p'p)  =  6.    (2.20) 


ap{ue)  ap{uyj)  ap{vn)  ap[vs) 

where  the  source  term  b  is 

6  =  pu^Ay  —  pUgAy  + /9v*Aa;  —  pu*  Aar  (2.21) 

Recall  that  Eq.  2.20  and  Eq.  2.21  are  written  for  the  pressure  control  volumes,  so  that 
there  is  some  interpretation  required.  The  term  ap{Ug)  in  Eq.  2.20  is  the  appropriate 
ap  for  the  discretized  u-momentum  equation,  Eq.  2.13.  In  other  words,  up  in  Eq.  2.13 
is  actually  Ug,  u^,,  u„,  or  Ug  in  Eq.  2.20  and  2.21,  relative  to  the  pressure  control 
volumes  on  the  staggered  grid.  Eq.  2.20  can  be  rearranged  into  the  same  general 
form  as  Eq.  2.11.  From  Eq.  2.21,  it  is  apparent  that  the  right-hand  side  term  is  the 
net  mass  flux  entering  the  control  volume,  which  should  be  zero  in  incompressible 
flow. 

In  the  formulation  of  the  pressure-correction  equation  for  boundary  control  vol- 
umes, one  makes  use  of  the  fact  that  the  normal  velocity  components  on  the  bound- 
aries are  known  from  either  Dirichlet  or  Neumann  boundary  conditions,  so  no  velocity 
correction  is  required  there.  Consequently,  the  formulation  of  Eq.  2.20  for  boundary 
control  volumes  does  not  require  any  prescription  of  boundary  p'  values  [60]  when 
velocity  boundary  conditions  are  prescribed.  Without  the  summation  from  Eq.  2.18, 
it  is  apparent  that  a  zero  velocity  correction  for  the  outflow  boundary  u-velocity 
component  is  obtained  when  p„,  =  pe — in  effect,  a  Neumann  boundary  condition  on 
pressure  is  implied.  This  boundary  condition  is  appropriate  for  an  incompressible 
fluid  because  it  is  physically  consistent  with  the  governing  equations  in  which  only 
the  pressure  gradient  appears.    There  is  a  unique  pressure  gradient  but  the  level  is 
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adjustable  by  any  constant  amount.  If  it  happens  that  there  is  a  pressure  specified 
on  the  boundary,  for  example  by  Eq.  1.4,  then  the  correction  there  will  be  zero,  pro- 
viding a  boundary  condition  for  Eq.  2.20.  Thus,  it  seems  that  there  are  no  concerns 
over  the  specification  of  boundary  conditions  for  the  p'  equations. 

2.4     Well-Posedness  of  the  Pressure-Correction  Equation 

2.4.1     Analvsis 

To  better  understand  the  characteristics  of  the  pressure-correction  step  in  the 
SIMPLE  procedure,  consider  a  model  3x3  computational  domain,  so  that  9  algebraic 
equations  for  the  pressure  corrections  are  obtained.  Number  the  control  volumes  as 
shown  in  Figure  2.3.  Then  the  system  of  p'  equations  can  be  written 
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where  the  superscript  designates  the  cell  location  and  the  subscript  designates  the 
coefficient  linking  the  point  in  question,  P,  and  the  neighboring  node.  The  right-hand 
side  velocities  are  understood  to  be  tilde  quantities  as  in  Eq.  2.21. 

In  finite-volume  discretizations,  fluxes  are  estimated  at  the  control  volume  faces 
which  are  common  to  adjacent  control  volumes,  so  if  the  governing  equations  are 
cast  in  conservation  law  form,  as  they  are  here,  the  discrete  efflux  of  any  quantity 
out  of  one  control  volume  is  guaranteed  to  be  identical  to  the  influx  into  its  neighbor. 
There  is  no  possibility  of  internal  sources  or  sinks.  In  fact  this  is  what  makes  finite- 
volume  discretizations  preferable  to  finite-diff'erence  discretizations.    The  following 
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relationships,  using  control  volume  5  in  Figure  2.3  as  an  example,  follow  from  Eq.  2.20 
and  the  internal  consistency  of  finite-volume  discretizations: 

a%^a%  +  a^w  +  al  +  a%,  (2.23) 

54565852  /n  o/)\ 

«M^  =  «£»   ^E  -  «VV»   <^N  =  «S5    O5  =  «A'  (2.24) 

ul  =  ul  u]  =  ul,  vl  =  vl  v',  =  vl  (2.25) 

Eq.  2.23  states  that  the  coefficient  matrix  is  pentadiagonal  and  diagonally  dominant 
for  the  interior  control  volumes.  Furthermore,  when  the  natural  boundary  condition 
(zero  velocity  correction)  is  applied,  the  appropriate  term  in  Eq.  2.20  for  the  boundary 
under  consideration  does  not  appear,  and  therefore  the  pressure-correction  equations 
for  the  boundary  control  volumes  also  satisfy  Eq.  2.23.  If  a  pressure  boundary  condi- 
tion is  applied  so  that  the  corresponding  pressure  correction  is  zero,  then  one  would 
set  p^  =  0  in  Eq.  2.20,  for  example,  which  would  give  aw  +  oa^  +  «5  <  op-  Thus, 
either  way,  the  entire  coefficient  matrix  in  Eq.  2.22  is  diagonally  dominant.  However, 
with  the  natural  prescription  for  boundary  treatment,  no  diagonal  term  exceeds  the 
sum  of  its  off-diagonal  terms. 

Thus,  the  system  of  equations  Eq.  2.22  is  linearly  dependent  with  the  natural 
(velocity)  boundary  conditions,  which  can  be  verified  by  adding  the  9  equations 
above.  Because  of  Eq.  2.23  and  Eq.  2.24  all  terms  on  the  left-hand  side  of  Eq.  2.22 
identically  cancel  one  another.  At  all  interior  control  volume  interfaces,  the  right- 
hand  side  terms  identically  cancel  due  to  Eq.  2.25,  and  the  remaining  source  terms 
are  simply  the  boundary  mass  fluxes.  This  cancellation  is  equivalent  to  a  discrete 
statement  of  the  divergence  theorem 

f  V  ■udn=   I    u-nd{dQ)  (2.26) 

Jn  Jan 
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where  f]  is  the  domain  under  consideration  and  n  is  the  unit  vector  in  the  direction 
normal  to  its  boundary  d^. 

Due  to  the  linear  dependence  of  the  left-hand  side  of  Eq.  2.22,  the  boundary  mass 
fluxes  must  also  sum  to  zero  in  order  for  the  system  of  equations  to  be  consistent. 
No  solution  exists  if  the  linearly  dependent  system  of  equations  is  inconsistent.  The 
situation  can  be  likened  to  a  steady-state  heat  conduction  problem  with  source  terms 
and  adiabatic  boundaries.  Clearly,  a  steady-state  solution  only  exists  if  the  sum  of 
the  source  terms  is  zero.  If  there  is  a  net  heat  source,  then  the  temperature  inside 
the  domain  will  simply  rise  without  bound  if  an  iterative  solution  strategy  (quasi 
time-marching)  is  used.  Likewise,  the  net  mass  source  in  flow  problems  with  open 
boundaries  must  sum  to  zero  for  the  pressure-correction  equation  to  have  a  solution. 

In  other  words,  global  mass  conservation  is  required  in  discrete  form  in  order  for  a 
solution  to  exist.  The  interesting  point  to  note  is  that  during  the  course  of  SIMPLE 
iterations,  when  the  pressure-correction  equation  is  executed,  the  velocity  field  does 
not  usually  conserve  mass  globally  in  flow  problems  with  open  boundaries,  unless 
explicit  measure  is  taken  to  enforce  global  mass  conservation.  The  purpose  of  solving 
the  pressure-correction  equations  is  to  drive  the  local  mass  sources  to  zero  by  suitable 
velocity  corrections.  But  the  pressure-correction  equations  which  are  supposed  to 
accomplish  this  purpose  do  not  have  a  solution  unless  the  net  mass  source  is  already 
zero.  For  domains  with  closed  boundaries,  global  mass  conservation  is  obviously  not 
an  issue. 

Furthermore,  this  problem  does  not  only  show  up  when  the  initial  guess  is  bad. 
In  the  backward-facing  step  flow  discussed  below,  the  initial  guess  is  zero  everywhere 
except  for  inflow,  which  obviously  is  the  worst  case  as  far  as  a  net  mass  source  is 
concerned  (all  inflow  and  no  outflow).  But  even  if  one  starts  with  a  mass-conserving 
initial  guess,  during  the  course  of  iterations  the  outflow  velocity  boundary  condition 
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which  is  necessary  to  solve  the  momentum  equations  will  reset  the  outflow  so  that 
the  global  mass-conservation  constraint  is  violated. 

2.4.2     Verification  by  Numerical  Experiments 

Support  for  the  preceding  discussion  is  provided  by  numerical  simulation  of  two 
model  problems,  a  lid-driven  cavity  flow  and  a  backward-facing  step  flow.  The  con- 
figurations are  shown  along  with  other  relevant  data  in  Figure  2.2. 

Figure  2.4  shows  the  outer-loop  convergence  paths  for  the  lid-driven  cavity  flow 
and  the  back  ward- facing  step  flow,  both  at  Re  =  100.  The  quantities  plotted  in 
Figure  2.4  are  the  logio  of  the  global  residuals  for  each  governing  equation  obtained 
by  summing  up  the  local  residuals,  each  of  which  is  obtained  by  subtracting  the 
left-hand  side  of  the  discretized  equations  from  the  right-hand  side.  For  the  cavity 
flow  there  are  no  mass  fluxes  across  the  boundary  so,  as  mentioned  earlier,  the  global 
mass  conservation  condition  is  always  satisfied  when  the  algorithm  reaches  the  point 
of  solving  the  system  of  p'-equations.  The  residuals  have  dropped  to  10~'  after  150 
iterations,  which  is  very  rapid  convergence,  indicating  that  good  pressure  and  velocity 
corrections  are  being  obtained. 

In  the  backward-facing  step  flow,  however,  the  flowfield  is  very  slow  to  develop 
because  no  global  mass  conservation  measure  is  enforced.  During  the  course  of  iter- 
ations, the  mass  flux  into  the  domain  from  the  left  is  not  matched  by  an  equal  flux 
through  the  outflow  boundary,  and  consequently  the  system  of  pressure-correction 
equations  which  is  supposed  to  produce  a  continuity-satisfying  velocity  field  does  not 
have  a  solution.  Correspondingly  one  observes  that  the  outer-loop  convergence  rate 
is  about  10  times  worse  than  for  cavity  flow. 

Also,  note  that  the  momentum  convergence  path  of  the  backward-facing  step  flow 
in  Figure  2.4  tends  to  follow  the  continuity  equation,  indicating  that  the  pressure  and 
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velocity  fields  are  strongly  coupled.  The  present  flow  problem  bears  some  similarity  to 
a  fully-developed  channel  flow,  in  which  the  streamwise  pressure-gradient  and  cross- 
stream  viscous  diflfusion  are  balanced,  so  the  observation  that  pressure  and  velocity 
are  strongly  coupled  is  intuitively  correct.  Thus,  the  convergence  path  is  controlled 
by  the  development  of  the  pressure  field.  The  slow  convergence  rate  problem  is  due 
to  the  inconsistency  of  the  system  of  pressure-correction  equations. 

The  inner-loop  convergence  path  (the  SLUR  iterations)  for  the  p'-system  of  equa- 
tions must  be  examined  to  determine  the  manner  in  which  the  inner-loop  inconsis- 
tency leads  to  poor  outer-loop  convergence  rates.  Table  2.1  shows  leading  eigenvalues 
for  successive  line-underrelaxation  iteration  matrices  of  the  p'-system  of  equations  at 
an  intermediate  iteration  for  which  the  outer-loop  residuals  had  dropped  to  approx- 
imately 10~^. 


Largest  3  eigenvalues 

Cavity  Flow 

Back-Step  Flow 

A3 

1.0 
0.956 
0.951 

1.0 
0.996 
0.984 

Table  2.1.  Largest  eigenvalues  of  iteration  matrices  during  an  intermediate  itera- 
tion, applying  the  successive  line-underrelaxation  iteration  scheme  to  the  p'-system  of 
equations. 


In  both  model  problems  the  spectral  radius  is  1.0  because  the  p'-system  of  equa- 
tions is  linearly  dependent.  The  next  largest  eigenvalue  is  smaller  in  the  cavity  flow 
computation  than  in  the  step  flow  computation,  which  means  a  faster  asymptotic  con- 
vergence rate.  However,  the  difference  between  0.996  and  0.956  is  not  large  enough 
to  produce  the  significant  difference  observed  in  the  outer  convergence  path. 

Figure  2.5  shows  the  inner-loop  residuals  of  the  SLUR  procedure  during  an  inter- 
mediate iteration.  The  two  momentum  equations  are  well-conditioned  and  converge 
to  a  solution  within  4  iterations.  In  Figure  2.5  for  the  cavity  fiow  case,  the  p'-equation 
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converges  to  zero,  although  this  happens  at  a  slower  rate  than  the  two  momentum 
equations  because  of  the  diffusive  nature  of  the  equation.  In  Figure  2.5  for  the  back- 
step  flow,  the  inner-loop  residual  is  fixed  on  a  nonzero  residual,  which  is  in  fact  the 
initial  level  of  inconsistency  in  the  system  of  equations,  i.e.  the  global  mass  deficit. 
Given  that  the  system  of  p'-  equations  which  is  being  solved  does  not  satisfy  the 
global  continuity  constraint,  however,  the  significance  or  utility  of  the  p'-field  that 
has  been  obtained  is  unknown. 

In  practice,  the  overall  procedure  may  still  be  able  to  lead  to  a  converged  solu- 
tion, as  in  the  present  case.  It  appears  that  the  outflow  extrapolating  procedure, 
a  zero-gradient  treatment  utilized  here,  can  help  induce  the  overall  computation  to 
converge  to  the  right  solution  [72].  Obviously,  such  a  lack  of  satisfaction  of  global 
mass  conservation  is  not  desirable  in  view  of  the  slow  convergence  rate. 

Further  study  suggests  that  the  iterative  solution  to  the  inconsistent  system  of 
p'-equations  converges  on  a  unique  pressure  gradient,  i.e.  the  difference  between  p' 
values  at  any  two  points  tends  to  a  constant  value,  even  though  the  p'-field  does  not 
in  general  satisfy  any  of  the  equations  in  the  system.  This  relationship  is  shown  in 
Figure  2.6,  in  which  the  convergence  of  the  difference  in  p'  between  the  lower-left  and 
upper-right  locations  in  the  domain  of  the  cavity  and  backward-facing  step  flows  is 
plotted.  Also  shown  is  the  value  of  p'  at  the  lower-left  corner  of  the  domain.  For  the 
cavity  flow,  there  is  a  solution  to  the  system  of  p'-equations,  and  it  is  obtained  by 
the  SLUR  technique  in  about  10  iterations.  Thus  all  the  pressure  corrections  and  the 
differences  between  them  tend  towards  constant  values.  In  the  backward- facing  step 
flow,  however,  the  individual  pressure  corrections  increase  linearly  with  the  number 
of  iterations,  symptomatic  of  the  inconsistency  in  the  system  of  equations.  The 
differences  between  p'  values  approach  a  constant,  however.  The  rate  at  which  this 
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unique  pressure-gradient  field  is  obtained  depends  on  the  eigenvalues  of  the  iteration 
matrix. 

To  resolve  the  inconsistency  problem  in  the  p'-system  of  equations  and  thereby 
improve  the  outer-loop  convergence  rate  in  the  backward- facing  step  flow,  global  mass 
conservation  has  been  explicitly  enforced  during  the  sequential  solution  procedure. 
The  procedure  used  is  to  compute  the  global  mass  deficit  and  then  add  a  constant 
value  to  the  outflow  boundary  u-velocities  to  restore  global  mass  conservation.  Al- 
ternatively, corrections  can  be  applied  at  every  streamwise  location  by  considering 
control  volumes  whose  boundaries  are  the  inflow  plane,  the  top  and  bottom  walls 
of  the  channel,  and  the  i=constant  line  at  the  specified  streamwise  location.  The 
artificially-imposed  convection  has  the  eff'ect  of  speeding  up  the  development  of  the 
pressure  field,  whose  normal  development  is  diffusion-dominated.  It  is  interesting  to 
note  that  this  physically-motivated  approach  is  in  essence  an  acceleration  of  conver- 
gence of  the  line-iterative  method  via  the  technique  called  additive  correction  [45,  69]. 
The  strategy  is  to  adjust  the  residual  on  the  current  line  to  zero  by  adding  a  con- 
stant to  all  the  unknowns  in  the  line.  This  procedure  is  done  for  every  line,  for  every 
iteration,  and  generally  produces  improvement  in  the  SLUR  solution  of  a  system  of 
equations.  Kelkar  and  Patankar  [45]  have  gone  one  step  further  by  applying  additive 
corrections  like  an  injection  step  of  a  multigrid  scheme,  a  so-called  block  correction 
technique.  This  technique  is  exploited  to  its  fullest  by  Hutchinson  and  Raithby  [42]. 
Given  a  fine-grid  solution  and  a  coarse  grid,  discretized  equations  for  the  correction 
quantities  on  the  coarse  grid  are  obtained  by  summing  the  equations  for  each  of  the 
fine-grid  cells  within  a  given  coarse  grid  cell.  A  solution  is  then  obtained  (by  direct 
methods  in  [45])  which  satisfies  conservation  of  mass  and  momentum.  The  corrections 
are  then  distributed  uniformly  to  the  fine  grid  cells  which  make  up  the  coarse  grid 
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cell,  and  the  iterative  solution  on  the  fine  grid  is  resumed.  However,  experiences  have 
shown  that  the  net  eflFect  of  such  a  treatment  for  complex  flow  problems  is  limited. 

Figure  2.7  illustrates  the  improved  convergence  rate  of  the  continuity  equation  for 
the  inner  and  outer  loops,  in  the  backward-facing  step  flow,  when  conservation  of  mass 
is  explicitly  enforced.  The  inner-loop  data  is  from  the  10th  outer-loop  iteration.  In 
Figure  2.7,  the  cavity  flow  convergence  path  is  also  shown  to  facilitate  the  comparison. 
For  the  back-step,  the  overall  convergence  rate  is  improved  by  an  order  of  magnitude, 
becoming  slightly  faster  than  the  cavity  flow  case.  This  result  reflects  the  improved 
inner-loop  performance,  also  shown  in  Figure  2.7.  The  improved  performance  for  the 
pressure-correction  equation  comes  at  the  expense  of  a  slightly  slower  convergence 
rate  for  the  momentum  equations,  because  of  the  nonlinear  convection  term. 

In  short,  it  has  been  shown  that  a  consistency  condition,  which  is  physically  the  re- 
quirement of  global  mass  conservation,  is  critical  for  meaningful  pressure-corrections 
to  be  guaranteed.  Given  natural  (velocity)  boundary  conditions,  which  lead  to  a 
linearly  dependent  system  of  pressure-correction  equations,  satisfaction  of  the  global 
continuity  constraint  is  the  only  way  that  a  solution  can  exist,  and  therefore  the  only 
way  that  the  inner-loop  residuals  can  be  driven  to  zero.  For  the  model  backward- 
facing  step  flow  in  a  channel  with  length  L  =  4  and  a  21  x  9  mesh,  the  mass- 
conservation  constraint  is  enforced  globally  or  at  every  streamwise  location  by  an 
additive-correction  technique.  This  technique  produces  a  10-fold  increase  in  the  con- 
vergence rate.  Physically,  modifying  the  u  velocities  has  the  same  effect  as  adding 
a  convection  term  to  the  Poisson  equation  for  the  p'-field,  which  otherwise  develops 
very  slowly.  A  coarse  grid  size  was  used  to  demonstrate  the  need  of  enforcing  global 
mass  conservation.  On  a  finer  grid,  this  issue  becomes  more  critical.  In  the  next 
section,  the  solution  accuracy  aspects  related  to  mass  conservation  will  be  addressed, 
and  the  computations  will  be  conducted  with  more  adequate  grid  resolution. 
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2.5     Numerical  Treatment  of  Outflow  Boundaries 

Continuing  with  the  theme  of  well-posedness,  the  next  numerical  issue  to  be  dis- 
cussed is  the  choice  of  outflow  boundary  location.  If  fluid  flows  into  the  domain  at 
a  boundary  where  extrapolation  is  applied,  then,  traditionally,  the  problem  is  not 
considered  to  be  well-posed,  because  the  information  which  is  being  transported  into 
the  domain  does  not  participate  in  the  solution  to  the  problem  [60].  Numerically, 
however,  accurate  solutions  can  be  obtained  using  first-order  extrapolation  for  the  ve- 
locity components  on  a  boundary  where  inflow  is  occurring  [72].  Here  open  boundary 
treatment  for  both  steady  and  time-dependent  flow  problems  is  investigated  further. 

Figure  2.9  and  2.8  present  streamfunction  contours  for  a  time-dependent  flow 
problem,  impulsively  started  backward-facing  step  flow,  using  central-differencing 
for  the  convection  terms  and  first-order  backward-differencing  in  time.  A  parabolic 
inflow  velocity  profile  is  specified,  while  outflow  boundary  velocities  are  obtained  by 
first-order  extrapolation.  The  Reynolds  number  based  on  the  average  inflow  velocity 
Uavg  and  the  channel  height  //,  is  800.  The  expansion  ratio  H/h  is  2  as  in  the  model 
problem  described  in  Figure  2.3.  Time-accurate  simulations  were  performed  for  two 
channel  configurations,  one  with  length  L  —  %  (81  x  41  mesh)  and  the  other  with 
length  L  =  16  (161  x  41  mesh).  This  flow  problem  has  been  the  subject  of  some 
recent  investigations  focusing  on  open  boundary  conditions  [30,  31]. 

For  each  time  step,  the  SIMPLE  algorithm  is  used  to  iteratively  converge  on  a 
solution  to  the  unsteady  form  of  the  governing  equations,  explicitly  enforcing  global 
conservation  of  mass  during  the  course  of  iterations.  In  the  present  study,  convergence 
was  declared  for  a  given  time  step  when  the  global  residuals  had  been  reduced  below 
10"''.    The  time-step  size  was  twice  the  viscous  time  scale  in  the  y-direction,  i.e. 
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At  —  2Ay'^/v.  Thus  a  fluid  particle  entering  the  domain  at  the  average  velocity  u  = 
1  travels  2  units  downstream  during  a  time-step. 

Figure  2.8  shows  the  formation  of  alternate  bottom/top  wall  recirculation  regions 
during  startup  which  gradually  become  thinner  and  elongated  as  they  drift  down- 
stream. For  the  L  =  16  simulation  (Figure  2.8),  the  transient  flowfield  has  as  many 
as  four  separation  bubbles  at  T  =  32,  the  latter  two  of  which  are  eventually  washed 
out  of  the  domain.  In  the  L  =  8  simulation  (Figure  2.9)  the  streamfunction  plots  are 
at  times  corresponding  to  those  shown  in  Figure  2.8.  Note  that  between  T  =  11  and 
T  =  32,  a  secondary  bottom  wall  recirculation  zone  forms  and  drifts  downstream, 
exiting  without  reflection  through  the  downstream  boundary.  The  time  evolution  of 
the  flowfield  for  the  L  =  8  and  L  =  16  simulations  is  virtually  identical. 

As  can  be  observed,  the  facts  that  a  shorter  channel  length  was  used  in  Figure  2.9 
and  that  a  recirculating  cell  may  go  through  the  open  boundary  do  not  affect  the 
solutions.  Figure  2.10  compares  the  computed  time  histories  of  the  bottom  wall 
reattachment  and  top  wall  separation  points  between  the  two  computations.  The 
L  =  8  and  L  =  16  curves  are  perfectly  overlapped.  The  steady-state  solutions  for 
both  the  L  =  8  and  L  =  16  channel  configurations  are  also  shown  in  Figure  2.9 
and  2.8,  respectively.  Although  the  outflow  boundary  cuts  the  top  wall  separation 
bubble  approximately  in  half,  there  is  no  apparent  difference  between  the  computed 
streamfunction  contours  for  0  <  x  <  8.  Furthermore,  the  convergence  rate  is  not 
affected  by  the  choice  of  outflow  boundary  location. 

Figure  2.11  compares  the  steady-state  u  and  v  velocity  profiles  at  ar  —  7  be- 
tween the  two  computations.  The  accuracy  of  the  computed  results  is  assessed  by 
comparison  with  an  FEM  numerical  solution  reported  by  Gartling  [27].  Figure  2.11 
establishes  quantitatively  that  the  two  simulations  differ  negligibly  over  0  <  x  <  8 
(the  V  profile  differs  on  the  order  of  10"^)  The  velocity  scale  for  the  problem  is  1. 
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Neither  v  profile  agrees  perfectly  with  the  solution  obtained  by  Gartling,  which  may 
be  attributed  to  the  need  for  conducting  further  grid  refinement  studies  in  the  present 
work  and/or  Gartling's  work. 

Evidently  the  location  of  the  open  boundary  is  not  critical  to  obtaining  a  con- 
verged solution.  This  observation  indicates  that  the  downstream  information  is  com- 
pletely accounted  for  by  the  continuity  equation.  The  correct  pressure  field  can  de- 
velop because  the  system  of  p'-equations  requires  only  the  boundary  mass  flux  specifi- 
cation. If  the  global  continuity  constraint  is  satisfied,  the  pressure-correction  equation 
is  consistent  regardless  of  whether  there  is  inflow  or  outflow  at  the  boundary  where 
extrapolation  is  applied.  The  numerical  well-posedness  of  the  open  boundary  com- 
putation results  in  virtually  identical  flowfield  development  for  the  time-dependent 
L  =  8  and  L  =  16  simulations  as  well  as  steady-state  solutions  which  agree  with  each 
other  and  follow  closely  Gartling's  benchmark  data  [27]. 

2.6     Concluding  Remarks 

In  order  for  the  SIMPLE  pressure-correction  method  to  be  a  well-posed  numer- 
ical procedure  for  open  boundary  problems,  explicit  steps  must  be  taken  to  ensure 
the  numerical  consistency  of  the  pressure-correction  system  of  equations  during  the 
course  of  iterations.  For  the  discrete  problem  with  the  natural  boundary  treatment 
for  pressure,  i.e.  normal  velocity  specified  at  all  boundaries,  global  mass  conserva- 
tion is  the  solvability  constraint  which  must  be  satisfied  in  order  that  the  system  of 
p'-equations  is  consistent.  Without  a  globally  mass-conserving  procedure  enforced 
during  each  iterative  step,  the  utility  of  the  pressure-corrections  obtained  at  each  it- 
eration cannot  be  guaranteed.  Overall  convergence  may  still  occur,  albeit  very  slowly. 
In  this  regard,  the  poor  outer-loop  convergence  behavior  simply  reflects  the  (poor) 
convergence  rate  of  the  inner-loop  iterations  of  the  SLUR  technique.  In  general,  the 
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inner-loop  residual  is  fixed  on  the  value  of  the  initial  level  of  inconsistency  of  the 
system  of  p'-equations  which  physically  is  the  global  mass  deficit.  The  convergence 
rate  can  be  improved  dramatically  by  explicitly  enforcing  mass  conservation  using 
an  additive-correction  technique.  The  results  of  numerical  simulations  of  backward- 
facing  step  flow  illustrate  and  support  these  conclusions. 

The  mass-conservation  constraint  also  has  implications  for  the  issue  of  proper 
numerical  treatment  of  open  boundaries  where  inflow  is  occurring.  Specifically,  the 
conventional  viewpoint  that  inflow  cannot  occur  at  open  boundaries  without  Dirich- 
let  prescription  of  the  inflow  variables  can  be  rebutted,  based  on  the  grounds  that 
the  numerical  problem  is  well-posed  if  the  normal  velocity  components  satisfy  the 
continuity  constraint. 
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Figure  2.1.     Staggered  grid  u  control  volume  and  the  nearby  variables  which  are 
involved  in  the  discretization  of  the  ti-momentum  equation. 
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Figure  2.2.  Description  of  two  model  problems.  Both  are  at  Re  =  100.  The  cavity 
is  a  square  with  a  top  wall  sliding  to  the  left,  while  the  backward-facing  step  is  a 
4x1  rectangular  domain  with  an  expansion  ratio  Ejh  =  2,  and  a  parabolic  inflow 
(average  inflow  velocity  =1).  The  cavity  flow  grid  is  9  x  9  and  the  step  flow  grid  is 
21  X  9.  The  meshes  and  the  velocity  vectors  are  shown. 
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Figure  2.3.  Model  3x3  computational  domain  with  numbered  control  volumes,  for 
discussion  of  Eq.  2.22.  The  staggered  velocity  components  which  refer  to  control 
volume  5  are  also  indicated. 
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Figure  2.4.  Outer-loop  convergence  paths  for  the  Re  —  100  lid-driven  cavity  and 
backward-facing  step  flows,  using  central-diff'erencing  for  the  convection  terms.  Leg- 
end:   //  equation: u  momentum  equation:  -.-.-.-  v  momentum  equation. 
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Figure  2.5.    Inner-loop  convergence  paths  for  the  Re  =   100  lid-driven  cavity  and 
backward-facing  step  flows.  The  vertical  axis  is  the  log^o  of  the  ratio  of  the  current 

residual  to  the  initial  residual.  Legend:  p'  equation: u  momentum  equation: 

-.-.-.-  V  momentum  equation. 
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Figure  2.6.    Variation  of  p'  with  inner-loop  iterations.    The  dashed  line  is  the  value 
of  p   at  the  lower-left  control  volume,  while  the  solid  line  is  the  difference  between 
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Figure  2.7.  Outer-loop  and  inner-loop  convergence  paths  of  the  p'  equation  for  the 
backward-facing  step  model  problem,  with  and  without  enforcing  the  continuity  con- 
straint. (1)  conservation  of  mass  not  enforced:  (2)  continuity  enforced  globally;  (3) 
cavity  flow. 
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Figure  2.9.  Time-dependent  flowfield  for  impulsively  started  backward-facing  step 
flow,  Re  =  800.  The  domain  has  length  L  =  8.  Streamfunction  contours  are  plotted 
at  several  instants  during  the  evolution  to  the  steady-state,  which  is  the  last  figure. 
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Time-Evolution  of  Reattachment/Separation  Locations 


Figure  2.10.  Time-dependent  location  of  bottom  wall  reattachment  point  and  top  wall 
separation  point  for  Re  =  800  impulsively  started  backward-facing  step  flow.  The 
curves  for  both  L  =  8  and  Z,  =  16  computations  are  shown;  they  overlap  identically. 
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Figure  2.11.  Comparison  of  u  and  r»-component  of  velocity  profiles  at  x  =  7.0  for 
the  L  =  16  and  L  =  8  backward-facing  step  simulations  at  Re  =  800,  with  central- 
differencing,  (o)  indicates  the  grid-independent  FEM  solution  obtained  by  GartUng. 
The  r  profile  is  scaled  up  by  lO"'. 


CHAPTER  3 
EFFICIENCY  AND  SCALABILITY  ON  SIMD  COMPUTERS 

The  previous  chapter  considered  an  issue  which  was  important  because  of  its  im- 
plications for  the  convergence  rate  in  open  boundary  problems.  The  present  chapter 
shifts  gears  to  focus  on  the  cost  and  efficiency  of  pressure-correction  methods  on 
SIMD  computers. 

As  discussed  in  Chapter  1,  the  eventual  goal  is  to  understand  the  indirect  cost  [23], 
i.e.  the  parallel  run  time,  of  such  methods  on  SIMD  computers,  and  how  this  cost 
scales  with  the  problem  size  and  the  number  of  processors.  The  run  time  is  just  the 
number  of  iterations  multiplied  by  the  cost  per  iteration.  This  chapter  considers  the 
cost  per  iteration. 

3.1     Background 

The  discussion  of  SIMD  computers  in  Chapter  1  indicated  similarities  in  the 
general  layout  of  such  machines  and  in  the  factors  which  affect  program  performance. 
More  detail  is  given  in  this  section  to  better  support  the  discussion  of  results. 

3.1.1     Speedup  and  Efficiency 

Speedup  S  is  defined  as 

S  =  ^,  (3.1) 

where  Tp  is  the  measured  run  time  using  Up  processors.  In  the  present  work  Ti  is 
the  run  time  of  the  parallel  algorithm  on  one  processor,  including  both  serial  and 
parallel  computational  work,  but  excluding  the  front-end-to-processor  and  interpro- 
cessor  communication.  On  a  MIMD  machine  it  is  sometimes  possible  to  actually  time 
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the  program  on  one  processsor,  but  each  SIMD  processor  is  not  usually  a  capable 
serial  computer  by  itself,  so  Ti  must  be  estimated.  The  timing  tools  on  the  CM-2 
and  CM-5  are  very  sophisticated,  and  can  separately  measure  the  time  elapsed  by 
the  processors  doing  computation,  doing  various  kinds  of  communication,  and  doing 
nothing  (waiting  for  an  instruction  from  the  front-end,  which  might  be  finishing  up 
some  serial  work  before  it  can  send  another  code  block).  Thus,  it  is  possible  to  make 
a  reasonable  estimate  for  Ti . 

Parallel  efficiency  is  the  ratio  of  the  actual  speedup  to  the  ideal  [up],  which  reflects 
the  overhead  costs  of  doing  the  computation  in  parallel: 

T-i  'J  actual    ^  1 1  ^Tp  fQ9^ 

•Jideal  "'p 

If  Tcomp  is  the  time  in  seconds  spent  by  each  of  the  Up  processors  doing  useful  work 
(computation),  T^nter-proc  is  the  time  spent  by  the  processors  doing  interprocessor 
communication,  and  Tfg^to-proc  is  the  time  elapsed  through  front-end-to-processor 
communication,  then  each  of  the  processors  is  busy  a  total  of  Tcomp  +  Tinter-proc 
seconds  and  the  total  run  time  on  multiple  processors  is  Tcomp  +  Tinter-proc  +  Tfg^to-proc 
seconds.  Assuming  that  the  parallelism  is  high,  i.e.  a  high  percentage  of  the  virtual 
processors  are  not  idle,  a  single  processor  would  need  UpTcomp  time  to  do  the  same 
work.  Thus,  Ti  =  ripTcomp-  and  from  Eq.  3.2  E  can  be  expressed  as 

-I   ~r  (-1  inter  —  proc    i    -i  fe-to-proc)  j  ^  comp  -'^     '     \ -^  comm  j  /  -t  comp 

Since  time  is  work  divided  by  speed,  E  depends  on  both  machine-related  factors  and 
the  implementational  factors  through  Eq.  3.3.  High  parallel  efficiency  is  not  neces- 
sarily a  product  of  fast  processors  or  fast  communications  considered  alone,  instead  it 
is  the  relative  speeds  that  are  important,  and  the  relative  amount  of  communication 
and  computation  in  the  program.  Consider  the  machine-related  factors  first. 
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3.1.2     Comparison  Between  CM-2.  CM-5.  and  MP-1 

A  32-node  CM-5  with  vector  units,  a  16k  processor  CM-2,  and  a  Ik  processor 
MP-1  were  used  in  the  present  study.  The  CM-5  has  4  GBytes  total  memory,  while 
the  CM-2  has  512  Mbytes,  and  the  MP-1  has  64  MBytes.  The  peak  speeds  of  these 
computers  are  4,  3.5,  and  0.034  Gflops,  respectively,  in  double  precision.  Per  proces- 
sor, the  peak  speeds  are  32,  7,  and  0.033  Mflops,  with  memory  bandwidths  of  128, 
25,  and  0.67  Mbytes/s  [67,  83].  Clearly  these  are  computers  with  very  different  capa- 
bilities, even  taking  into  account  the  fact  that  peak  speeds,  which  are  based  only  on 
the  processor  speed  under  ideal  conditions,  are  not  an  accurate  basis  for  comparison. 

In  the  CM-2  and  CM-5  the  front-end  computers  are  Sun-4  workstations,  while 
in  the  MP-1  the  front-end  is  a  Decstation  5000.  From  Eq.  3.3,  it  is  clear  that  the 
relative  speeds  of  the  front-end  computer  and  the  processors  are  important.  Their 
ratio  determines  the  importance  of  the  front-end-to-processor  type  of  communication. 
On  the  CM-2  and  MP-1,  there  is  just  one  of  these  intermediate  processors,  called 
either  a  sequencer  or  an  array  control  unit,  respectively,  while  on  the  32-node  CM-5 
the  32  SPARC  microprocessors  have  the  role  of  sequencers. 

Each  SPARC  node  broadcasts  to  four  vector  units  (VUs)  which  actually  do  the 
work.  Thus  a  32-node  CM-5  has  128  independent  processors.  In  the  CM-2  the  "pro- 
cessors'' are  more  often  called  processing  elements  (PEs),  because  each  one  consists  of 
a  floating-point  unit  coupled  with  32  bit-serial  processors.  Each  bit-serial  processor 
is  the  memory  manager  for  a  single  bit  of  a  32-bit  word.  Thus,  the  16k-processor 
CM-2  actually  has  only  512  independent  processing  elements.  This  strange  CM-2 
processor  design  came  about  basically  as  a  workaround  which  was  introduced  to  im- 
prove the  memory  bandwidth  for  floating-point  calculations  [66].  Compared  to  the 
CM-5  VUs,  the  CM-2  processors  are  about  one-fourth  as  fast,  with  larger  overhead 
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costs  associated  with  memory  access  and  computation.  The  MP-1  has  1024  4-bit 
processors — compared  to  either  the  CM-5  or  CM-2  processors,  the  MP-1  processors 
are  very  slow.  The  generic  term  "processing  element"  (PE),  which  is  used  occassion- 
ally  in  the  discussion  below,  refers  to  either  one  of  the  VUs,  one  of  the  512  CM-2 
processors,  or  one  of  the  MP-1  processors,  whichever  is  appropriate. 

For  the  present  study,  the  processors  are  either  physically  or  logically  imagined 
to  be  arranged  as  a  2-d  mesh,  which  is  a  layout  that  is  well-supported  by  the  data 
networks  of  each  of  the  computers.  The  data  network  of  the  32-node  CM-5  is  a 
fat  tree  of  height  3,  which  is  similar  to  a  binary  tree  except  the  bandwidth  stays 
constant  upwards  from  height  2  at  160  MBytes/s  (details  in  [83]).  One  can  expect 
approximately  480  MBytes/s  for  regular  grid  communication  patterns  (i.e.  between 
nearest-neighbor  SPARC  nodes)  and  128  MBytes/s  for  random  (global)  communica- 
tions. The  randomly-directed  messages  have  to  go  farther  up  the  tree,  so  they  are 
slower.  The  CM-2  network  (a  hypercube)  is  completely  different  from  the  fat-tree  net- 
work and  its  performance  for  regular  grid  communication  between  nearest-neighbor 
processors  is  roughly  350  MBytes/s  [67].  The  grid  network  on  the  CM-2  is  called 
NEWS  (North-East- West-South).  It  is  a  subset  of  the  hypercube  connections  se- 
lected at  run  time.  The  MP-1  has  two  networks:  regular  communications  use  X-Net 
(1.25  GBytes/s,  peak)  which  connects  each  processor  to  its  eight  nearest  neighbors, 
and  random  communications  use  a  3-stage  crossbar  (80  MBytes/s,  peak). 

To  summarize  the  relative  speeds  of  these  three  SIMD  computers  it  is  sufficient 
for  the  present  study  to  observe  that  the  MP-1  has  very  fast  nearest-neighbor  com- 
munication compared  to  its  computational  speed,  while  the  exact  opposite  is  true  for 
the  CM-2.  The  ratio  of  nearest-neighbor  communication  speed  to  computation  speed 
is  smaller  still  for  the  CM-5  than  the  CM-2.  Again,  from  Eq.  3.3,  one  expects  that 
these  differences  will  be  an  important  factor  influencing  the  parallel  efficiency. 
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3.1.3     Hierarchical  and  Cut-and-Stack  Data  Mappings 

When  there  are  more  array  elements  (grid  points)  than  processors,  each  processor 
handles  multiple  grid  points.  Which  grid  points  are  assigned  to  which  processors  is 
determined  by  the  "data-mapping,"  also  called  the  data  layout.  The  processors  repeat 
any  instructions  the  appropriate  number  of  times  to  handle  all  the  array  elements 
which  have  been  assigned  to  it.  A  useful  idealization  for  SIMD  machines,  however, 
is  to  pretend  there  are  always  as  many  processors  as  grid  points.  Then  one  speaks  of 
the  ''virtual  processor"  ratio  (VP)  which  is  the  number  of  array  elements  assigned  to 
each  physical  processor.  The  way  the  data  arrays  are  partitioned  and  mapped  to  the 
processors  is  a  main  concern  for  developing  a  parallel  implementation.  The  layout  of 
the  data  determines  the  amount  of  communication  in  a  given  program. 

When  the  virtual  processor  ratio  is  1,  there  are  an  equal  number  of  processors 
and  array  elements  and  the  mapping  is  just  one-to-one.  When  VP  >  1  the  mapping 
of  data  to  processors  is  either  "hierarchical,"  in  CM-Fortran,  or  "cut-and-stack"  in 
MP-Fortran.  These  mappings  are  also  termed  "block"  and  "cyclic"  [85],  respectively, 
in  the  emerging  High-Performance  Fortran  standard.  The  relative  merits  of  these 
different  approaches  have  not  been  completely  explored  yet. 

In  cut-and-stack  mapping,  nearest-neighbor  array  elements  are  mapped  to  nearest- 
neighbor  physical  processors.  When  the  number  of  array  elements  exceeds  the  num- 
ber of  processors,  additional  memory  layers  are  created.  VP  is  just  the  number  of 
memory  layers.  In  the  general  case,  nearest-neighbor  virtual  processors  (i.e.  array 
elements)  will  not  be  mapped  to  the  same  physical  processor.  Thus,  the  cost  of  a 
nearest-neighbor  communication  of  distance  one  will  be  proportional  to  VP,  since  the 
nearest-neighbors  of  each  virtual  processor  will  be  on  a  different  physical  processor. 
In  the  hierarchical  mapping,  contiguous  pieces  of  an  array  ("virtual  subgrids")  are 
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mapped  to  each  processor.  The  "subgrid  size"  for  the  hierarchical  mapping  is  syn- 
onymous with  VP.  The  distinction  between  hierarchical  and  cut-and-stack  mapping 
is  clarified  by  Figure  3.1. 

In  hierarchical  mapping,  for  VP  >  1,  each  virtual  processor  has  nearest-neighbors 
in  the  same  virtual  subgrid,  that  is,  on  the  same  physical  processor.  Thus,  for  hier- 
archical mapping  on  the  CM-2,  interprocessor  communication  breaks  down  into  two 
types  (with  different  speeds) — on-processor  and  off-processor.  Off-processor  commu- 
nication on  the  CM-2  has  the  NEWS  speed  given  above,  while  on-processor  communi- 
cation is  somewhat  faster,  because  it  is  essentially  just  a  memory  operation.  A  more 
detailed  presentation  and  modelling  of  nearest-neighbor  communication  costs  for  the 
hierarchical  mapping  on  the  CM-2  is  given  in  [3].  The  key  idea  is  that  with  hierar- 
chical mapping  on  the  CM-2  the  relative  amount  of  on-processor  and  off-processor 
communication  is  the  area  to  perimeter  ratio  of  the  virtual  subgrid. 

For  the  CM-5,  there  are  three  types  of  interprocessor  communication:  (1)  between 
virtual  processors  on  the  same  processor  (that  is,  the  same  VU),  (2)  between  virtual 
processors  on  different  VUs  but  on  the  same  SPARC  node,  and  (3)  between  virtual 
processors  on  different  SPARC  nodes.  Between  different  SPARC  nodes  (number  2), 
the  speed  is  480  MBytes/s  as  mentioned  above.  On  the  same  VU  the  speed  is  16 
GBytes/s.  (The  latter  number  is  just  the  aggregate  memory  bandwidth  of  the  32- 
node  CM-5.)  Thus,  although  off-processor  NEWS  communication  is  slow  compared 
to  computation  on  the  CM-2  and  CM-5,  good  efficiencies  can  still  be  achieved  as  a 
consequence  of  the  data  mapping  which  allows  the  majority  of  communication  to  be 
of  the  on-processor  type. 
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3.2     Implementional  Considerations 

The  cost  per  SIMPLE  iteration  depends  on  the  choice  of  relaxation  method 
(solver)  for  the  systems  of  equations,  the  number  of  inner  iterations  (i/^,  i^^,  and  Uc), 
the  computation  of  coefficients  for  each  system  of  equations,  the  correction  step,  and 
the  convergence  checking  and  serial  work  done  in  program  control.  The  pressure- 
correction  equation,  since  it  is  not  underrelaxed,  typically  needs  to  be  given  more 
iterations  than  the  momentum  equations,  and  consequently  most  of  the  effort  is  ex- 
pended during  this  step  of  the  SIMPLE  method.  This  is  another  reason  why  the 
convergence  rate  of  the  p'-equations  discussed  in  Chapter  2  is  important.  Typically 
z/„  and  Uy  are  the  same  and  are  <  3,  and  Vc  <  St'u. 

In  developing  a  parallel  implementation  of  the  SIMPLE  algorithm,  the  first  con- 
sideration is  the  method  of  solving  the  u,  u,  and  p'  systems  of  equations.  For  serial 
computations,  successive  line-underrelaxation  using  the  tridiagonal  matrix  algorithm 
(TDMA,  whose  operation  count  is  0{N))  is  a  good  choice  because  the  cost  per  it- 
eration is  optimal  and  there  is  long-distance  coupling  between  flow  variables  (along 
lines),  which  is  effective  in  promoting  convergence  in  the  outer  iterations.  The  TDMA 
is  intrinsically  serial.  For  parallel  computations,  a  parallel  tridiagonal  solver  must  be 
used  (parallel  cyclic  reduction  in  the  present  work).  In  this  case  the  cost  per  it- 
eration depends  not  only  on  the  computational  workload  {0{Nlog2N))  but  also  on 
the  amount  of  communication  generated  by  the  implementation  on  a  particular  ma- 
chine. For  these  reasons,  timing  comparisons  are  made  for  several  implementations 
of  both  point-  and  line-Jacobi  solvers  used  during  the  inner  iterations  of  the  SIMPLE 
algorithm. 
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Generally,  point-Jacobi  iteration  is  not  sufficiently  effective  for  complex  flow  prob- 
lems. However,  as  part  of  a  multigrid  strategy,  good  convergence  rates  can  be  ob- 
tained (see  Chapters  4  and  5).  Furthermore,  because  it  only  involves  the  fastest  type 
of  interprocessor  communication,  that  which  occurs  between  nearest-neighbor  pro- 
cessors, point-Jacobi  iteration  provides  an  upper  bound  for  parallel  efficiency,  against 
which  other  solvers  can  be  compared. 

The  second  consideration  is  the  treatment  of  boundary  computations.  In  the 
present  implementation,  the  coefficients  and  source  terms  for  the  boundary  control 
volumes  are  computed  using  the  interior  control  volume  formula  and  mask  arrays. 
Oran  et  al.  [57]  have  called  this  trick  the  uniform  boundary  condition  approach. 
All  coefficients  can  be  computed  simultaneously.  The  problem  with  computing  the 
boundary  coefficients  separately  is  that  some  of  the  processors  are  idle,  which  de- 
creases E.  For  the  CM-5,  which  is  "synchronized  MIMD"  instead  of  strictly  SIMD, 
there  exists  limited  capability  to  handle  both  boundary  and  interior  coefficients  si- 
multaneously without  formulating  a  single  all-inclusive  expression.  However,  this 
capability  cannot  be  utilized  if  either  the  boundary  or  interior  formulas  involve  in- 
terprocessor communication,  which  is  the  case  here.  As  an  example  of  the  uniform 
approach,  consider  the  source  terms  for  the  north  boundary  u  control  volumes,  which 
are  computed  by  the  formula 

b  =  a^UN  +  {p^-pe)Ay  (3.4) 

Recall  that  a^  represents  the  discretized  convective  and  diffusive  flux  terms,  and  u^ 
is  the  boundary  value,  and  in  the  pressure  gradient  term.  Ay  is  the  vertical  dimension 
of  the  u  control  volume  and  p-w/Pe  are  the  west/east  u-control- volume  face  pressures 
on  the  staggered  grid.  Similar  modifications  show  up  in  the  south,  east,  and  west 
boundary  u  control  volume  source  terms.    To  compute  the  boundary  and  interior 
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source  terms  simultaneously,  the  following  implementation  is  used: 

b  —  aboundaryUboundary  +  (Pw  "  Pe)Ay  (3.5) 


w 


here 


Uboundary  =  Uj^In  +  Ush  +  UeIe  +  Uw  Iw  (3-6) 


and 


O-boundary  =  QAf/iV  +  dsh  +  O-eIe  +  dwlw  (3-7) 

/^f,  /s,  Ie,  and  Iw  are  the  mask  arrays,  which  have  the  value  1  for  the  respective 
boundary  control  volumes  and  0  everywhere  else.  They  are  initialized  once,  at  the 
beginning  of  the  program.  Then,  every  iteration,  there  are  four  extra  nearest-neighbor 
communications.  A  comparison  of  the  uniform  approach  with  an  implementation  that 
treats  each  boundary  separately  is  discussed  in  the  results. 

3.3     Numerical  Experiments 

The  SIMPLE  algorithm  for  two-dimensional  laminar  flow  has  been  timed  on  a 
range  of  problem  sizes  from  8  x  8  to  1024  x  1024  which,  on  the  CM-5.  covers  up 
to  V P  =  8192.  The  convection  terms  are  central-differenced.  A  fixed  number  (100) 
of  outer  iterations  are  timed  using  as  a  model  flow  problem  the  lid-driven  cavity 
flow  at  Re  =  1000.  The  timings  were  made  with  the  "Prism"  timing  utility  on 
the  CM-2  and  CM-5,  and  the  "dpuTimer"  routines  on  the  MP-1  [52,  86].  These 
utilities  can  be  inaccurate  if  the  front-end  machine  is  heavily  loaded,  which  was  the 
case  with  the  CM-2.  Thus,  on  the  CM-2  all  cases  were  timed  three  times  and  the 
fastest  times  were  used,  as  recommended  by  Thinking  Machines  [82].  Prism  times 
every  code  block  and  accumulates  totals  in  several  categories,  including  computation 
time  for  the  nodes  [Tcomp],  "NEWS"  communication  {Tnews)-,  and  irregular-pattern 
"SEND"  communication.   Also  it  is  possible  to  infer  Tje-to-proc  from  the  difference 
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between  the  processor  busy  time  and  the  elapsed  time.  In  the  results  Tcomm  is  the 
sum  of  the  "NEWS"  and  "SEND"  interprocessor  times.  The  front-end-to-processor 
communication  is  separate.  Additionally,  the  component  tasks  of  the  algorithm  have 
been  timed,  namely  the  coefficient  computations  {Tcoejj)-,  the  solver  [Tsoive]-:  and  the 
velocity-correction  and  convergence-checking  parts. 

3.3.1     Efficiencv  of  Point  and  Line  Solvers  for  the  Inner  Iterations 

Figure  3.2,  based  on  timings  made  on  the  CM-5,  illustrates  the  difference  in 
parallel  efficiency  for  SIMPLE  using  point-Jacobi  and  line-Jacobi  iterative  solvers.  E 
is  computed  from  Eq.  3.3  by  timing  Tcomm  and  T^omp  introduced  above.  Problem  size 
is  given  in  terms  of  the  virtual  processor  ratio  VP  previously  defined. 

There  are  two  implementations  each  with  different  data  layouts,  for  point-Jacobi 
iteration.  One  ignores  the  distinction  between  virtual  processors  which  are  on  the 
same  physical  processor  and  those  which  are  on  different  physical  processors.  Each 
array  element  is  treated  as  if  it  is  a  processor.  Thus,  interprocessor  communication 
is  generated  whenever  data  is  to  be  moved,  even  if  the  two  virtual  processors  do- 
ing the  communication  happen  to  be  on  the  same  physical  processor.  To  be  more 
precise,  a  call  to  the  run-time  communication  library  is  generated  for  every  array  el- 
ement. Then,  those  array  elements  (virtual  processors)  which  actually  reside  on  the 
same  physical  processor  are  identified  and  the  communication  is  done  as  a  memory 
operation — but  the  unnecessary  overhead  of  calling  the  library  is  incurred.  Obviously 
there  is  an  inefficiency  associated  with  pretending  that  there  are  as  many  processors 
as  array  elements,  but  the  tradeoff'  is  that  this  is  the  most  straightforward,  and  indeed 
the  intended,  way  to  do  the  programming.  In  Figure  3.2,  this  approach  is  labelled 
"NEWS,"  with  the  symbol  "o."  The  other  implementation  is  labelled  "on-VU,"  with 
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the  symbol  "+,"  to  indicate  that  interprocessor  communication  between  virtual  pro- 
cessors on  the  same  physical  processor  is  being  eliminated — the  programming  is  in  a 
sense  being  done  "on-VU." 

To  indicate  to  the  compiler  the  different  layouts  of  the  data  which  are  needed, 
the  programmer  inserts  compiler  directives.  For  the  "NEWS"  version,  the  arrays  are 
laid  out  as  shown  in  this  example  for  a  Ik  x  Ik  grid  and  an  8  x  16  processor  layout 
on  the  CM-5: 

REAL*8  A(1024,1024) 
$CMF  LAYOUT  A(:BLOCK=128  :PR0CS=8,  :BLOCK=64  :PR0CS=16) 

Thus,  the  subgrid  shape  is  128  x  64,  with  a  subgrid  size  (VP)  of  8192  (this  hap- 
pens to  be  the  biggest  problem  size  for  my  program  on  a  32-node  CM-5  with  4GBytes 
of  memory).  When  shifting  all  the  data  to  their  east  nearest-neighbor,  for  example, 
by  far  the  large  majority  of  transfers  are  on-VU  and  could  be  done  without  real  inter- 
processor communication.  But  there  are  only  2  dimensions  in  A,  so  that  data-parallel 
program  statements  cannot  specifically  access  certain  array  elements,  i.e.  the  ones  on 
the  perimeter  of  the  subgrid.  Thus  it  is  not  possible  with  the  "NEWS"  layout  to 
treat  interior  virtual  processors  differently  from  those  on  the  perimeter,  and  conse- 
quently data  shifts  between  the  interior  virtual  processors  generate  interprocessor 
communication  even  though  it  is  unnecessary. 

In  the  "on-VU"  version,  a  different  data  layout  is  used  which  makes  explicit  to  the 
compiler  the  boundary  between  physical  processors.  The  arrays  are  laid  out  without 
virtual  processors: 

$CMF  LAYOUT  A(:SERIAL,:SERIAL,:BL0CK=1  :PR0CS=8,:BL0CK=1  :PR0CS=16) 

The  declaration  must  be  changed  accordingly,  to  A(128,64,8,16).  Normally  it  is 
inconvenient  to  work  with  the  arrays  in  this  manner.  Thus  the  approach  taken  here 
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is  to  use  an  "array  alias"  of  A  [84].  In  other  words,  this  is  an  EQUIVALENCE  func- 
tion for  the  data-parallel  arrays  (similar  to  the  Fortran77  EQUIVALENCE  concept), 
which  equates  A(1024,1024)  with  A(128,64,8,16),  with  the  different  LAYOUTs  given 
above.  It  is  the  alias  instead  of  the  original  A  which  is  used  in  the  on-VU  point- 
Jacobi  implementation.  In  the  solver,  the  "on-VU"  layout  is  used;  everywhere  else, 
the  more  convenient  "NEWS"  layout  is  used.  The  actual  mechanism  by  which  the 
equivalencing  of  distributed  arrays  can  be  accomplished  is  not  too  difficult  to  under- 
stand. The  front-end  computer  stores  "array  descriptors,"  which  contain  the  array 
layout,  the  starting  address  in  processor  memory,  and  other  information.  The  actual 
layout  in  each  processors'  memory  is  linear  and  doesn't  change,  but  multiple  array 
descriptors  can  be  generated  for  the  same  data.  This  descriptor  multiplicity  is  what 
array  aliasing  accomplishes.  With  the  "on-VU"  programming  style,  the  compiler 
does  not  generate  communication  when  the  shift  of  data  is  along  a  SERIAL  axis. 
Thus,  interprocessor  communication  is  generated  only  when  the  virtual  processors 
involved  are  on  different  physical  processors,  i.e.  only  when  it  is  truly  necessary.  The 
difference  in  the  amount  of  communication  is  substantial  for  large  subgrid  sizes. 

For  both  the  "NEWS"  and  the  "on-VU"  curves  in  Figure  3.2,  E  is  initially  very 
low,  but  as  V P  increases,  E  rises  until  it  reaches  a  peak  value  of  about  0.8  for  the 
"NEWS"  version  and  0.85  for  the  "on-VU"  version.  The  trend  is  due  to  the  amor- 
tization of  the  front-end-to-processor  and  off-VU  (between  VUs  which  are  physically 
under  control  of  different  SPARC  nodes)  communication.  The  former  contributes  a 
constant  overhead  cost  per  Jacobi  iteration  to  Tcomm-,  while  the  latter  has  a  VP^'"^ 
dependency  [3].  However,  it  does  not  appear  from  Figure  3.2  that  these  two  terms' 
effects  can  be  distinguished  from  one  another. 

For  VP  >  2k,  the  CM-5  is  computing  roughly  3/4  of  the  time  for  the  implementa- 
tion which  uses  the  "NEWS"  version  of  point-Jacobi,  with  the  remainder  split  evenly 
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between  front-end-to-processor  communication  and  on-VU  interprocessor  communi- 
cation. It  appears  that  the  "on-VU"  version  has  more  front-end-to-processor  com- 
munication per  iteration,  so  there  is,  in  effect,  a  price  of  more  front-end-to-processor 
communication  to  pay  in  exchange  for  less  interprocessor  communication.  Conse- 
quently it  takes  VP  >  4k  to  reach  peak  efficiency  instead  of  2k  with  the  "NEWS" 
version.  For  VP  >  4k,  however,  E  is  about  5%  ~  10%  higher  than  for  the  "NEWS" 
version  because  the  on-VU  communication  has  been  replaced  by  straight  memory 
operations. 

The  observed  difference  would  be  even  greater  if  a  larger  part  of  the  total  parallel 
run  time  was  spent  in  the  solver.  For  the  large  VP  cases  in  Figure  3.2,  approximately 
equal  time  was  spent  computing  coefficients  and  solving  the  systems  of  equations. 
"Typical"  numbers  of  inner  iterations  were  used,  3  each  for  the  u  and  v  momentum 
equations,  and  9  for  the  p'  equation.  From  Figure  3.2,  then,  it  appears  that  the  ad- 
vantage of  the  "on-VU"  version  over  the  "NEWS"  version  of  point-Jacobi  relaxation 
within  the  SIMPLE  algorithm  is  around  0.1  in  E,  for  large  problem  sizes. 

Red/black  analogues  to  the  "NEWS"  and  "on-VU"  versions  of  point-Jacobi  iter- 
ation have  also  been  tested.  Red/black  point  iteration  done  in  the  "on-VU"  manner 
does  not  generate  any  additional  front-end-to-processor  communication,  and  there- 
fore takes  almost  an  identical  amount  of  time  as  point-Jacobi.  Thus  red/black  point 
iterations  are  recommended  when  the  "on-VU"  layout  is  used  due  to  their  improved 
convergence  rate.  However,  with  the  "NEWS"  layout,  red/black  point  iteration  gen- 
erates two  code  blocks  instead  of  one,  and  reduces  by  2  the  amount  of  computation 
per  code  block.  This  results  in  a  substantial  (~  35%  for  the  VP  =  8k  case)  in- 
crease in  run  time.  Thus,  if  using  "NEWS"  layouts,  red/black  point  iteration  is  not 
cost-effective. 
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There  are  also  two  implementations  of  line-Jacobi  iteration.  In  both,  one  inner 
iteration  consists  of  forming  a  tridiagonal  system  of  equations  for  the  unknowns  in 
each  vertical  line  by  moving  the  east/west  terms  to  the  right-hand  side,  solving  the 
multiple  systems  of  equations  simultaneously,  and  repeating  the  procedure  for  the 
horizontal  lines. 

In  the  first  version,  parallel  cyclic  reduction  is  used  to  solve  the  multiple  tridiag- 
onal systems  of  equations  (see  [44]  for  a  clear  presentation).  This  involves  combining 
equations  to  decouple  the  system  into  even  and  odd  equations.  The  result  is  two 
tridiagonal  systems  of  equations  each  half  the  size  of  the  original.  The  reduction  step 
is  repeated  logj  A'^  times,  where  N  is  the  number  of  unknowns  in  each  line.  Thus,  the 
computational  operation  count  is  0{Nlog2N).  Interprocessor  communication  occurs 
for  every  unknown  for  every  step,  thus  the  communication  operation  count  is  also 
0{Nlog2N).  However,  the  distance  for  communication  increases  every  step  of  the  re- 
duction by  a  factor  of  2.  For  the  first  step,  nearest-neighbor  communication  occurs, 
while  for  the  second  step,  the  distance  is  2,  then  4,  etc.  Thus,  the  net  communi- 
cation speed  is  slower  than  the  nearest-neighbor  type  of  communication.  Figure  3.2 
confirms  this  argument — E  peaks  at  about  0.5  compared  to  0.8  for  point-Jacobi  it- 
eration. In  other  words,  for  VP  >  4k,  interprocessor  communication  takes  as  much 
time  as  computation  with  the  line-Jacobi  solver  using  cyclic  reduction. 

In  the  second  version,  the  multiple  systems  of  tridiagonal  equations  are  solved 
using  the  standard  TDMA  algorithm  along  the  lines.  To  implement  this  version, 
one  must  remap  the  arrays  from  (:NEWS,:NEWS)  to  (:NEWS,:SERIAL),  for  the 
vertical  lines,  and  to  (:SERIAL,:NEWS)  for  the  horizontal  lines.  This  change  from 
rectangular  subgrids  to  1-d  slices  is  the  most  time-consuming  step,  involving  a  global 
communication  of  data  ("SEND"  instead  of  "NEWS").  Applied  along  the  serial  di- 
mension, the  TDMA  does  not  generate  any  interprocessor  communication.    Some 
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front-end-to-processor  communication  is  generated  by  the  incrementing  of  the  DO- 
loop  index,  but  unrolling  the  DO-loop  helps  to  amortize  this  overhead  cost  to  some 
extent.  Thus,  in  Figure  3.2  E  is  approximately  constant  at  0.14,  except  for  very  small 
VP.  The  global  communication  is  much  slower  than  computation  and  consequently 
there  is  not  enough  computation  to  amortize  the  communication.  Furthermore,  the 
constant  E  implies  from  Eq.  3.3  that  Tcomm  and  Tcomp  both  scale  in  the  same  way 
with  problem  size.  It  is  evident  that  Tcomp  ~  VP  because  the  TDMA  is  0{N).  Thus 
constant  E  implies  Tcomm  ~  VP-  This  means  doubling  VP  doubles  Tcomm,  indicating 
the  communication  speed  has  reached  its  peak,  which  further  indicates  that  the  full 
bandwidth  of  the  fat-tree  is  being  utilized. 

The  disappointing  performance  of  the  standard  line-iterative  approach  using  the 
TDMA  points  out  the  important  fact  that,  for  the  CM-5,  global  communication 
within  inner  iterations  is  intolerable.  There  is  not  enough  computation  to  amortize 
slow  communication  in  the  solver  for  any  problem  size.  With  parallel  cyclic  reduction, 
where  the  regularity  of  the  data  movement  allows  faster  communication,  the  efficiency 
is  much  higher,  although  still  significantly  lower  than  for  point-iterations.  Additional 
improvement  can  be  sought  by  using  the  "on-VU"  data  layout  to  implement  the 
line-iterative  solver  within  each  processor's  subgrid.  This  implementation  essentially 
trades  interprocessor  communication  for  the  front-end-to-PE  type  of  communication, 
and  in  practice  a  front-end  bottleneck  develops.  For  the  remainder  of  the  discussion, 
all  line-Jacobi  results  refer  to  the  parallel  cyclic  reduction  implementation. 

On  the  MP-1,  the  front-end-to-processor  communication  is  not  a  major  concern, 
as  can  be  inferred  from  Figure  3.3.  The  efficiency  of  the  SIMPLE  algorithm  using 
the  point-Jacobi  solver  is  plotted  for  each  machine  for  the  range  of  problem  sizes 
corresponding  to  the  cases  solved  on  the  MP-1.  The  CM-2  and  CM-5  can  solve 
much  larger  problems,  so  for  comparison  purposes  only  part  of  their  data  is  shown. 
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Also,  because  the  computers  have  different  numbers  of  processors,  the  number  of  grid 
points  is  used  instead  of  VP  to  define  the  problem  size. 

As  in  Figure  3.2,  each  curve  exhibits  an  initial  rise  corresponding  to  the  amortiza- 
tion of  the  front-end-to-processor  communication  and,  for  the  CM-2  and  CM-5,  the 
off-processor  "NEWS"  communication.  On  the  MP-1,  peak  E  is  reached  for  small 
problems  {VP  >  32).  Due  to  the  MP-l's  relatively  slow  processors,  the  computa- 
tion time  quickly  amortizes  the  front-end-to-processor  communication  time  as  VP 
increases.  Furthermore,  because  the  relative  speed  of  X-Net  communication  is  fast, 
the  peak  E  is  high,  0.85.  On  the  CM-2,  the  peak  E  is  0.4,  and  this  efficiency  is 
reached  for  approximately  VP  >  128.  On  the  CM-5,  the  peak  E  is  0.8,  but  this 
efficiency  is  not  reached  until  VP  >  2k.  If  computation  is  fast,  then  the  rate  of  in- 
crease of  E  with  VP  depends  on  the  relative  cost  of  on-processor,  off-processor,  and 
front-end-to-processor  communication.  If  the  on-processor  communication  is  fast, 
larger  VP  is  required  to  reach  peak  E.  Thus,  on  the  CM-5,  the  relatively  fast  on-VU 
communication  is  simultaneously  responsible  for  the  good  (0.8)  peak  E,  and  the  fact 
that  very  large  problem  sizes,  {VP  >  2k,  64  times  larger  than  on  the  MP-1),  are 
needed  to  reach  this  peak  E. 

The  aspect  ratio  of  the  virtual  subgrid  constitutes  a  secondary  effect  of  the  data 
layout  on  the  efficiency  for  hierarchical  mapping.  The  major  influence  on  E  depends 
on  VP,  i.e.  the  subgrid  size,  but  the  subgrid  shape  matters,  too.  This  dependence 
comes  into  play  due  to  the  different  speeds  of  the  on-processor  and  off-processor  types 
of  communication.  Higher  aspect  ratio  subgrids  have  higher  area  to  perimeter  ratios, 
and  thus  relatively  more  of  off-processor  communication  than  square  subgrids. 

Figure  3.4  gives  some  idea  of  the  relative  importance  of  the  subgrid  aspect  ratio 
effect.  Along  each  curve  the  number  of  grid  points  is  fixed,  but  the  grid  dimensions 
vary,  which,  for  a  given  processor  layout,  causes  the  subgrid  shape  (aspect  ratio),  to 
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vary.  For  example,  on  the  CM-5  with  an  8  x  16  processor  layout,  the  following  grids 
were  used  corresponding  to  the  VP  =  1024  CM-5  curve:  256  x  512,  512  x  256,  680  x 
192,  and  1024  x  128.  These  cases  give  subgrid  aspect  ratios  of  1,  4,  7,  and  16.  Tnews 
is  the  time  spent  in  "NEWS"  type  of  interprocessor  communication  and  Tcomp  is  the 
time  spent  doing  computation  during  100  SIMPLE  iterations.  The  solver  for  these 
results  is  point-Jacobi  relaxation. 

For  the  VP  =  1024  CM-5  case,  increasing  the  aspect  ratio  from  1  to  16  causes 
Tnews/Tcomp  to  iucrease  from  0.3  to  0.5.  This  increase  in  Tnews/Tcomp  increases  the 
run  time  for  100  iterations  from  15s  to  20s,  and  decreases  the  efficiency  from  0.61  to 
0.54.  For  the  VP  =  8192  CM-5  case,  increasing  the  aspect  ratio  from  1  to  16  causes 
Tnews/Tcomp  to  iucrcase  from  0.19  to  0.27.  This  increase  in  Tnews/Tcomp  increases  the 
run  time  for  100  iterations  from  118s  to  126s,  and  decreases  the  efficiency  from  0.74 
to  0.72.  Thus,  the  aspect  ratio  effect  diminishes  as  VP  increases  due  to  the  increasing 
area  of  the  subgrid.  In  other  words  the  variation  in  the  perimeter  length  matters  less, 
percentage-wise,  as  the  area  increases.  The  CM-2  results  are  similar.  However,  on 
the  CM-2  the  on-PE  type  of  communication  is  slower  than  on  the  CM-5,  relative  to 
the  computational  speed.  Thus.  Tnews/Tcomp  ratios  are  higher  on  the  CM-2. 

3.3.2     Effect  of  Uniform  Boundarv  Condition  Implementation 

In  addition  to  the  choice  of  solver,  the  treatment  of  boundary  coefficient  computa- 
tions was  discussed  earlier  as  an  important  consideration  affecting  parallel  efficiency. 
Figure  3.5  compares  the  implementation  described  in  the  introductory  section  of  this 
chapter,  to  an  implementation  which  treats  the  boundary  control  volumes  separate 
from  the  interior  control  volumes.  The  latter  approach  involves  some  1-d  operations 
which  leave  some  processors  idle. 


70 


The  results  indicated  in  Figure  3.5  were  obtained  on  the  CM-2,  using  point- 
Jacobi  relaxation  as  the  solver.  With  the  uniform  approach,  the  ratio  of  the  time 
spent  computing  coefficients,  Tcoejj,  to  the  time  spent  solving  the  equations,  Tsoive^ 
remains  constant  at  0.6  for  VP  >  256.  Both  Tcoejj  and  Tsoive  ~  V'F  in  this  case,  so 
doubling  VP  doubles  both  Tcoejj  and  T.oive,  leaving  their  ratio  unchanged.  The  value 
0.6  reflects  the  relative  cost  of  coefficient  computations  compared  to  point-Jacobi 
iteration.  There  are  three  equations  for  which  coefficients  are  computed  and  15  total 
inner  iterations,  3  each  for  the  u  and  v  equations,  and  9  for  the  p'  equation.  Thus  if 
more  inner  iterations  are  taken,  the  ratio  of  Tcoejj  to  Tsoive  will  decrease,  and  vice- 
versa.  With  the  1-d  implementation,  Tcoeff/Tsohe  increases  until  VP  >  1024.  Both 
Tcoeff  and  Tsoive  scale  with  VP  asymptotically,  but  Figure  3.5  shows  that  Tcoe/j  has  an 
apparently  very  significant  square-root  component  due  to  the  boundary  operations.  If 
N  is  the  number  of  grid  points  and  Up  is  the  number  of  processors,  then  VP  —  Njrij,. 
For  boundary  operations,  N^^'^  control  volumes  are  computed  in  parallel  with  only 
n^/^  processors — hence  the  VP^/^  contribution  to  Tcoejj-  From  Figure  3.5,  it  appears 
that  very  large  problems  are  required  to  reach  the  point  where  the  interior  coefficient 
computations  amortize  the  boundary  coefficient  computations.  Even  for  large  VP 
when  Tcoefj/Tsoive  is  approaching  a  constant,  this  constant  is  larger,  approximately 
0.8  compared  to  0.6  for  the  uniform  approach,  due  to  the  additional  front-end-to- 
processor  communication  which  is  intrinsic  to  the  1-d  formulation. 

3.3.3     Overall  Performance 

Table  3.1  summarizes  the  relative  performance  of  SIMPLE  on  the  CM-2,  CM-5, 
and  MP-1  computers,  using  point  and  line-iterative  solvers  and  the  uniform  boundary 
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condition  treatment.  In  the  first  three  cases  the  "NEWS"  implementation  of  point- 
Jacobi  relaxation  is  the  solver,  while  the  last  two  cases  are  for  the  line-Jacobi  solver 
using  cyclic  reduction. 


Machine 

Solver 

Problem 
Size 

VP 

T, 

Time/Iter./Pt. 

Speed  ) 
(MFlops) 

Peak 
Speed 

512  PE 
CM-2 

Point- 
Jacobi 

512  X 
1024 

1024 

188  s 

2.6  X  10"^  s 

147 

4 

128  VU 
CM-5 

Point- 
Jacobi 

736  X 
1472 

8192 

137  s 

1.3  X  10-^  s 

417 

10 

1024  PE 
MP-1 

Point- 
Jacobi 

512  X 
512 

256 

316  s 

1.2  X  10-^  s 

44* 

59 

512  PE 
CM-2 

Line- 
Jacobi 

512  X 
1024 

1024 

409  s 

7.8  X  10"^  s 

133 

3 

128  VU 
CM-5 

Line- 
Jacobi 

736  X 
1472 

8192 

453  s 

4.2  X  10-*^  s 

247 

6 

Table  3.1.  Performance  results  for  the  SIMPLE  algorithm,  for  100  iterations  of  the 
model  problem.  The  solvers  are  the  point-Jacobi  ("NEWS")  and  line-Jacobi  (cyclic 
reduction)  implementations.  3,  3,  and  9  inner  iterations  are  used  for  the  u,  v,  and  p ' 
equations,  respectively.  '  The  speeds  are  for  double-precision  calculations,  except  on 
the  MP-1. 

In  Table  3.1,  the  speeds  reported  are  obtained  by  comparing  the  timings  with 
the  identical  code  timed  on  a  Cray  C90,  using  the  Cray  hardware  performance  mon- 
itor to  determine  Mflops.  In  terms  of  Mflops,  the  CM-2  version  of  the  SIMPLE 
algorithm's  performance  appears  to  be  consistent  with  other  CFD  algorithms  on  the 
CM-2.  Jesperson  and  Levit  [44]  report  117  Mflops  for  a  scalar  implicit  version  of  an 
approximate  factorization  Navier-Stokes  algorithm  using  parallel  cyclic  reduction  to 
solve  the  tridiagonal  systems  of  equations.  This  result  was  obtained  for  a  512  x  512 
simulation  of  2-d  flow  over  a  cylinder  using  a  16k  CM-2  as  in  the  present  study  (a 
different  execution  model  was  used  (see  [3,  47]  for  details).  The  measured  time  per 
time-step  per  grid  point  was  1.6  x  10~^  seconds.  By  comparison,  the  performance  of 
the  SIMPLE  algorithm  for  the  512  x  1024  problem  size  using  the  line-Jacobi  solver  is 
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133  Mflops  and  7.8  x  10"^  seconds  per  iteration  per  grid  pt.  Egolf  [20]  reports  that  the 
TEACH  Navier-Stokes  combustor  code  based  on  a  sequential  pressure-based  method 
with  a  solver  that  is  comparable  to  point-Jacobi  relaxation,  obtains  a  performance 
which  is  3.67  times  better  than  a  vectorized  Cray  X-MP  version  of  the  code,  for  a 
model  problem  with  3.2  x  IC*  nodes.  The  present  program  runs  1.6  times  faster  than 
a  single  Cray  C90  processor  for  a  128  x  256  problem  (32k  grid  points).  One  Cray 
C-90  processor  is  about  2-4  times  faster  than  a  Cray  X-MP.  Thus,  the  present  code 
runs  comparably  fast. 

3.3.4     Isoefhciency  Plot 

Figures  3.2-3.4  addressed  the  effects  of  the  inner-iterative  solver,  the  boundary 
treatment,  the  data  layout,  and  the  variation  of  parallel  efficiency  with  problem  size 
for  a  fixed  number  of  processors.  Varying  the  number  of  processors  is  also  of  interest 
and,  as  discussed  in  Chapter  1,  an  even  more  practical  numerical  experiment  is  to 
vary  Up  in  proportion  with  the  problem  size,  i.e.  the  scaled-size  model. 

Figure  3.6,  which  is  based  on  the  point-Jacobi  MP-1  timings,  incorporates  the 
above  information  into  one  plot,  which  has  been  called  an  isoefficiency  plot  by  Kumar 
and  Singh  [46].  The  lines  are  paths  along  which  the  parallel  efficiency  E  remains 
constant  as  the  problem  size  and  the  number  of  processors  Up  vary.  Using  the  point- 
Jacobi  solver  and  the  uniform  boundary  coefficient  implementation,  each  SIMPLE 
iteration  has  no  substantial  contribution  from  operations  which  are  less  than  fully 
parallel  or  from  operations  whose  time  depends  on  the  number  of  processors.  The 
efficiency  is  only  a  function  of  the  virtual  processor  ratio,  thus  the  lines  are  straight. 
Much  of  the  parameter  space  is  covered  by  efficiencies  between  0.6  and  0.8. 

The  reason  that  the  present  implementation  is  linearly  scalable  is  that  the  oper- 
ations are  all  scalable — each  SIMPLE  iteration  has  predominantly  nearest-neighbor 
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communication  and  computation  and  full  parallelism.  Thus,  Tj,  depends  on  VP. 
Local  communication  speed  does  not  depend  on  Up. 

Ti  depends  on  the  problem  size  A^.  Thus,  as  A^  and  Up  are  increased  in  proportion, 
starting  from  some  initial  ratio,  the  efficiency  from  Eq.  3.3  stays  constant.  If  the  initial 
problem  size  is  large  and  the  corresponding  parallel  run  time  is  acceptable,  then  one 
can  quickly  get  to  very  large  problem  sizes  while  still  maintaining  Tp  constant  by 
increasing  Up  a  relatively  small  amount  (along  the  E  =  0.85  curve).  If  the  desired 
run  time  is  smaller,  then  initially  (i.e.  starting  from  small  Up)  the  efficiency  will  be 
lower.  Then  the  scaled-size  experiment  requires  relatively  more  processors  to  get 
to  a  large  problem  size  along  the  constant  efficiency  (constant  Tp  for  point-Jacobi 
ierations)  curve.  Thus,  the  most  desirable  situation  occurs  when  the  efficiency  is 
high  for  an  initially  small  problem  size. 

For  this  case  the  fixed-time  and  scaled-size  methods  are  equivalent,  because  the 
problem  size  Ti  depends  on  A'^  per  iteration.  However  this  is  not  the  case  when 
the  SIMPLE  inner  iterations  are  done  with  the  line-Jacobi  solver  using  parallel  cyclic 
reduction.  Cyclic  reduction  requires  (131og2  A'^-t-l)A'^  operations  to  solve  a  tridiagonal 
system  of  A^  equations  [44].  Thus,  Ti  ~  (131og2  A^  +  l)N  and  on  Up  =  N  processors, 
Tp  ~  13  log2  A  + 1  because  every  processor  is  active  during  every  step  of  the  reduction 
and  there  are  131og2  A^-|-l  steps.  Since  VP  =  1,  every  processor's  time  is  proportional 
to  the  number  of  steps,  assuming  each  step  costs  about  the  same. 

In  the  scaled-size  approach,  one  doubles  Up  and  A^  together,  which  therefore  gives 
Ti  ~  (261og2  •2N  +  2)N  and  Tp  ~  131og2  2N  +  1.  The  efficiency  is  1,  but  Tp  is  increased 
and  Ti  is  more  than  doubled.  In  the  fixed-time  approach,  then,  one  concludes  that 
A^  must  be  increased  by  a  factor  which  is  less  than  two,  and  rip  must  be  doubled,  in 
order  to  maintain  constant  Tp.  If  a  plot  like  Figure  3.6  is  constructed,  it  should  be 
done  with  Ti  instead  of  A'^  as  the  measure  of  problem  size.    In  that  case,  the  lines 
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of  constant  efficiency  would  be  described  as  Ti  ~  rip,  with  a  >  1.  The  ideal  case  is 
a  =  I.  In  addition  to  the  operation  count,  there  is  another  factor  which  reduces  the 
scalability  of  cyclic  reduction,  namely  the  time  per  step  is  not  actually  the  same  as 
was  assumed  above — later  steps  require  communication  over  longer  distances  which 
is  slower.  In  practice,  however,  no  more  than  a  few  steps  are  necessary  because  the 
coupling  between  widely-separated  equations  becomes  very  weak.  As  the  system  is 
reduced  the  diagonal  becomes  much  larger  than  the  off-diagonal  terms  which  can 
then  be  neglected  and  the  reduction  process  abbreviated. 

In  short,  the  basic  prerequisite  for  scaled-size  constant  efficiency  is  that  the 
amount  of  work  per  SIMPLE  iteration  varies  with  VP  and  that  the  overheads  and 
inefficiencies,  specifically  the  time  spent  in  communication  and  the  fraction  of  idle 
processors,  do  not  grow  relative  to  the  useful  computational  work  as  Up  and  A'^  are 
increased  proportionally.  The  SIMPLE  implementation  developed  here  using  the 
point-iterative  solvers,  Jacobi  and  red/black,  have  this  linear  computational  scalabil- 
ity property. 

On  the  other  hand,  the  convergence  rate  of  point-iterative  methods  increases  at  a 
rate  greater  than  the  problem  size,  so  although  Tp  can  be  maintained  constant  while 
the  problem  size  and  Up  are  scaled  up,  the  convergence  rate  deteriorates.  Hence  the 
total  run  time  (cost  per  iteration  multiplied  by  the  number  of  iterations)  increases. 
This  lack  of  numerical  scalability  of  standard  iterative  methods  like  point-Jacobi 
relaxation  is  the  motivation  for  the  development  of  multigrid  strategies. 

3.4     Concluding  Remarks 

The  SIMPLE  algorithm,  especially  using  point-iterative  methods,  is  efficient  on 
SIMD  machines  and  can  maintain  a  relatively  high  efficiency  as  the  problem  size  and 
the  number  of  processors  is  scaled  up.   However,  boundary  coefficient  computations 
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need  to  be  folded  in  with  interior  coefficient  computations  to  achieve  good  efficiencies 
at  smaller  problem  sizes.  For  the  CM-5,  the  inefficiency  caused  by  idle  processors 
in  a  1-d  boundary  treatment  was  significant  over  the  entire  range  of  problem  sizes 
tested.  The  line-Jacobi  solver  based  on  parallel  cyclic  reduction  leads  to  a  lower 
peak  E  (0.5  on  the  CM-5)  than  the  point-Jacobi  solver  (0.8),  because  there  is  more 
communication  and  on  average  this  communication  is  less  localized.  On  the  other 
hand,  the  asymptotic  convergence  rates  of  the  two  methods  are  also  different  and  need 
to  be  considered  on  a  problem-by-problem  basis.  The  speeds  which  are  obtained  with 
the  line-iterative  method  are  consistent  and  comparable  with  other  CFD  algorithms 
on  SIMD  computers. 

The  key  factor  in  obtaining  high  parallel  efficiency  for  the  SIMPLE  algorithm 
on  the  computers  used,  is  fast  nearest-neighbor  communication  relative  to  the  speed 
of  computation.  On  the  CM-2  and  CM-5,  hierarchical  mapping  allows  on-processor 
communication  to  dominate  the  slower  off-processor  form(s)  of  communication  for 
large  V P.  The  efficiency  is  low  for  small  problems  because  of  the  relatively  large 
contribution  to  the  run  time  from  the  front-end-to-processor  type  of  communication, 
but  this  type  of  communicaton  is  constant  and  becomes  less  important  as  the  problem 
size  increases. 

Once  the  peak  E  is  reached,  the  efficiency  is  determined  by  the  balance  of  compu- 
tation and  on-processor  communication  speeds — for  the  CM-5,  using  a  point-Jacobi 
solver,  E  approaches  approximately  0.8,  while  on  the  CM-2  the  peak  efficiency  is  0.4, 
which  reflects  the  fact  that  the  CM-5  vector  units  have  a  better  balance,  at  least  for 
the  operations  in  this  algorithm,  than  the  CM-2  processors. 

The  rate  at  which  E  approaches  the  peak  value  depends  on  the  relative  contribu- 
tions of  on-  and  off-processor  communication  and  front-end-to-processor  communica- 
tion to  the  total  run  time.  On  the  CM-5,  V P  >  2k  is  required  to  reach  peak  E.  This 
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problem  size  is  about  one-fourth  the  maximum  size  which  can  be  accommodated, 
and  yet  still  larger  than  many  computations  on  traditional  vector  supercomputers. 
Clearly  a  gap  is  developing  between  the  size  of  problems  which  can  be  solved  effi- 
ciently in  parallel  and  the  size  of  problems  which  are  small  enough  to  be  solved  on 
serial  computers. 

For  parallel  computations  of  all  but  the  largest  problems,  then,  the  data  layout 
issue  is  very  important-  in  going  from  a  square  subgrid  to  one  with  aspect  ratio  of 
16,  for  a  VP  =  Ik  case  on  the  CM-5,  the  run  time  increased  by  25%.  On  the  MP-1, 
hierarchical  mapping  is  not  needed,  because  the  processors  are  slow  compared  to  the 
X-Net  communication  speed.  The  peak  E  is  0.85  with  the  point-Jacobi  solver,  and 
this  performance  is  obtained  for  VP  >  32,  which  is  about  one-eighth  the  size  of 
the  largest  case  possible  for  this  machine.  Thus,  with  regards  to  achieving  efficient 
performance  in  the  teraflops  range,  the  comparison  given  here  suggests  a  preference 
for  numerous  slow  processors  instead  of  fewer  fast  ones,  but  such  a  computer  may  be 
difficult  and  expensive  to  build. 
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4  X  1  Layout  of  Processors 
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Figure  3.1.  Mapping  an  8  element  array  A  onto  4  processors.  For  the  cut-and- 
stack  mapping,  nearest-neighbors  array  elements  are  mapped  to  nearest-neighbor 
physical  processors.  For  the  hierarchical  mapping,  nearest-neighbor  array  elements 
are  mapped  to  nearest-neighbor  mr^Ma/ processors,  which  may  be  on  the  same  physical 
processor. 
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Efficiency  vs.  VP 
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Figure  3.2.  Parallel  efficiency,  E,  as  a  function  of  problem  size  and  solver,  for  the 
CM-5  cases.  The  number  of  grid  points  is  the  virtual  processor  ratio,  VP,  multiplied 
by  the  number  of  processors,  128.  E  is  computed  from  Eq.  3.3.  It  reflects  the  relative 
amount  of  communication,  compared  to  computation,  in  the  algorithm. 
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E  vs.  Problem  Size 
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Figure  3.3.  Comparison  between  the  CM-2,  CM-5  and  MP-1.  The  variation  of 
parallel  efficiency  with  problem  size  is  shown  for  the  model  problem,  using  point- 
Jacobi  relaxation  as  the  solver.  E  is  calculated  from  Eq.  3.3,  and  Ti  =  ripTcomp  for 
the  CM-2  and  CM-5,  where  Tcomp  is  measured.  For  the  MP-1  cases,  T,  is  the  front-end 
time,  scaled  down  to  the  estimated  speed  of  the  MP-1  processors  (0.05  Mflops). 
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Figure  3.4.  Effect  of  subgrid  aspect  ratio  on  interprocessor  communication  time, 
Tnews,  for  the  hierarchical  data-mapping  (CM-2  and  CM-5).  Tnews  is  normalized  by 
Tcomp  in  order  to  show  how  the  aspect  ratio  effect  varies  with  problem  size,  without 
the  complication  of  the  fact  that  Tcomp  varies  also. 
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Figure  3.5.  Normalized  coefficient  computation  time  as  a  function  of  problem  size, 
for  two  implementations  (on  the  CM-2).  In  the  l-d  case  the  boundary  coefficients 
are  handled  by  l-d  array  operations.  In  the  2-d  case  the  uniform  implementaton 
computes  both  boundary  and  interior  coefficients  simultaneously.  Tcoejj  is  the  time 
spent  computing  coefficients  in  a  SIMPLE  iteration;  T^oive  is  the  time  spent  in  point- 
Jacobi  iterations.  There  are  15  point-Jacobi  iterations  [u^  =  Uy  =  ?>  and  Uc  =  9). 
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Isoefficiency  Curves 
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Figure  3.6.  Isoefficiency  curves  based  on  the  MP-1  cases  and  SIMPLE  method  with 
the  point-Jacobi  solver.  Efficiency  E  is  computed  from  Eq.  3.3.  Along  lines  of 
constant  E  the  cost  per  SIMPLE  iteration  is  constant  with  the  point-Jacobi  solver 
and  the  uniform  boundary  condition  implementation. 


CHAPTER  4 
A  NONLINEAR  PRESSURE-CORRECTION  MULTIGRID  METHOD 

The  single-grid  timing  results  focused  on  the  cost  per  iteration  in  order  to  elucidate 
the  computational  issues  which  influence  the  parallel  run  time  and  the  scalability.  But 
the  parallel  run  time  is  the  cost  per  iteration  multiplied  by  the  number  of  iterations. 
For  scaling  to  large  problem  sizes  and  numbers  of  processors,  the  numerical  method 
must  scale  well  with  respect  to  convergence  rate,  also. 

The  convergence  rate  of  the  single-grid  pressure-correction  method  deteriorates 
with  increasing  problem  size.  This  trait  is  inherited  from  the  smoothing  property  of 
the  stationary  linear  iterative  method,  point  or  line-Jacobi  relaxation,  used  to  solve 
the  systems  of  u,  v,  and  p'  equations  during  the  course  of  SIMPLE  iterations.  Point- 
Jacobi  relaxation  requires  0{N-^)  iterations,  where  A'^  is  the  number  of  grid  points, 
to  decrease  the  solution  error  by  a  specified  amount  [1].  In  other  words,  the  number 
of  iterations  increases  faster  than  the  problem  size. 

At  best  the  cost  per  iteration  stays  constant  as  the  number  of  processors  Up 
increases  proportional  to  the  problem  size.  Thus,  the  total  run  time  increases  in 
the  scaled-size  experiment  using  single-grid  pressure-correction  methods,  due  to  the 
increased  number  of  iterations  required.  This  lack  of  numerical  scalability  is  a  serious 
disadvantage  for  parallel  implementations,  since  the  target  problem  size  for  parallel 
computation  is  very  large. 

Multigrid  methods  can  maintain  good  convergence  rates  as  the  problem  size  in- 
creases. For  Poisson  equations,  problem-size  independent  convergence  rates  can  be 
obtained  [36.  55].  The  recent  book  by  Briggs  [10]  introduces  the  major  concepts  in 
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the  context  of  Poisson  equations.  See  also  [11,  37,  90]  for  surveys  and  analyses  of 
multigrid  convergence  properties  for  more  general  linear  equations.  For  a  description 
of  practical  techniques  and  special  considerations  for  fluid  dynamics,  see  the  impor- 
tant early  papers  by  Brandt  [5,  6].  However,  there  are  many  unresolved  issues  for 
application  to  the  incompressible  Navier-Stokes  equations,  especially  with  regards  to 
their  implementation  and  performance  on  parallel  computers.  The  purpose  of  this 
chapter  is  to  describe  the  relevant  convergence  rate  and  stability  issues  for  multigrid 
methods  in  the  context  of  application  to  the  incompressible  Navier-Stokes  equations, 
with  numerical  experiments  used  to  illustrate  the  points  made,  in  particular,  regard- 
ing the  role  of  the  restriction  and  prolongation  procedures. 

4.1     Background 

The  basic  concept  is  the  use  of  coarse  grids  to  accelerate  the  asymptotic  con- 
vergence rate  of  an  inner  iterative  scheme.  The  inner  iterative  method  is  called  the 
"smoother"  for  reasons  to  be  made  clear  shortly.  In  the  context  of  the  present  applica- 
tion to  the  incompressible  Navier-Stokes  equations,  the  single-grid  pressure-correction 
method  is  the  inner  iterative  scheme.  Because  the  pressure-correction  algorithm  also 
uses  inner  iterations — to  solve  the  systems  of  u,  i?,  and  p'  equations — the  multigrid 
method  developed  here  actually  has  three  nested  levels  of  iterations. 

A  multigrid  V  cycle  begins  with  a  certain  number  of  smoothing  iterations  on  the 
fine  grid,  where  the  solution  is  desired.  Figure  4.1  shows  a  schematic  of  a  V(3,2)  cycle. 
In  this  case  three  pressure-correction  iterations  are  done  first.  Then  residuals  and 
variables  are  restricted  (averaged)  to  obtain  coarse-grid  values  for  these  quantities. 
The  solution  to  the  coarse-grid  discretized  equation  provides  a  correction  to  the  fine- 
grid  solution.  Once  the  solution  on  the  coarse  grid  is  obtained,  the  correction  is 
interpolated  (prolongated)  to  the  fine  grid  and  added  back  into  the  solution  there. 
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Some  post-smoothing  iterations,  two  in  this  case,  are  needed  to  eliminate  errors 
introduced  by  the  interpolation.  Since  it  is  usually  too  costly  to  attempt  a  direct 
solution  on  the  coarse  grid,  this  smoothing-correction  cycle  is  applied  recursively, 
leading  to  the  V  cycle  shown. 

The  next  section  describes  how  such  a  procedure  can  accelerate  the  convergence 
rate  of  an  iterative  method,  in  the  context  of  linear  equations.  The  multigrid  scheme 
for  nonlinear  scalar  equations  and  the  Navier-Stokes  system  of  equations  is  then 
described.  Brandt  [5]  was  the  first  to  formalize  the  manner  in  which  coarse  grids 
could  be  used  as  a  convergence-acceleration  technique  for  a  given  smoother.  The 
idea  of  using  coarse  grids  to  generate  initial  guesses  for  fine-grid  solutions  was  around 
much  earlier. 

The  cost  of  the  multigrid  algorithm,  per  cycle,  is  dominated  by  the  smoothing  cost, 
as  will  be  shown  in  Chapter  5.  Thus,  with  regard  to  the  parallel  run  time  per  multigrid 
iteration,  the  smoother  is  the  primary  concern.  Also,  with  regard  to  the  convergence 
rate,  the  smoother  is  important.  The  single-grid  convergence  rate  characteristics 
of  pressure-correction  methods,  the  dependence  on  Reynolds  number,  flow  problem, 
and  the  convection  scheme,  carry  over  to  the  multigrid  context.  However,  in  the 
multigrid  method  the  smoother's  role  is,  as  the  name  implies,  to  smooth  the  fine-grid 
residual,  which  is  a  different  objective  than  to  solve  the  equations  quickly.  A  smooth 
fine-grid  residual  equation  can  be  approximated  accurately  on  a  coarser  grid.  The 
next  section  describes  an  alternate  pressure-based  smoother,  and  compares  its  cost 
against  the  pressure-correction  method  on  the  CM-5. 

Stability  of  multigrid  iterations  is  also  an  important  unresolved  issue.  There  are 
two  ways  in  which  multigrid  iterations  can  be  caused  to  diverge.  First,  the  single-grid 
smoothing  iterations  can  diverge,  for  example  if  central-differencing  is  used  there  are 
possibly  stability  problems  if  the  Reynolds  number  is  high.  Second,  poor  coarse-grid 
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corrections  can  cause  divergence  if  the  smoothing  is  insufficient.  In  a  sense  this  latter 
issue,  the  scheme  and  intergrid  transfer  operators  which  prescribe  the  coordination 
between  coarse  and  fine  grids  in  the  multigrid  procedure,  is  the  key  issue.  In  the  next 
section  two  "stabihzation  strategies"  are  described.  Then,  the  impact  of  different 
restriction  and  prolongation  procedures  on  the  convergence  rate  is  studied  in  the 
context  of  two  model  problems,  lid-driven  cavity  flow  and  flow  past  a  symmetric 
backward-facing  step.  These  two  particular  flow  problems  have  different  physical 
characteristics,  and  therefore  the  numerical  experiments  should  give  insight  into  the 
problem-dependence  of  the  results. 

4.1.1     Terminologv  and  Scheme  for  Linear  Equations 

The  discrete  problem  to  be  solved  can  be  written  A'^u''  =  S'',  corresponding  to 
some  differential  equation  L[u]  =  S.  The  set  of  values  u^  is  defined  by 

{u,,}  =  uiihjh),  (z,  j)  e  ([0  :  iV],  [0  :  iV])  =  n\  (4.1) 

Similarly,  i/^''  is  defined  on  the  coarser  grid  Jl^''  with  grid  spacing  2h.  The  variable  u 
can  be  a  scalar  or  a  vector,  and  the  operator  A  can  be  linear  or  nonlinear. 

For  linear  equations,  the  "correction  scheme"  (CS)  is  frequently  used.  A  two- 
level  multigrid  cycle  using  CS  accelerates  the  convergence  of  an  iterative  method 
(with  iteration  matrix  P)  by  the  following  procedure: 

Do  u  fine-grid  iterations  v'^  •^  P'^v'^ 
Compute  residual  on  ft''  r^  =  A^v^  -  S"" 
Restrict  r''  to  Q^^  r^^  =  Il^r^ 

Solve  exactly  for  e^/"         ^2/1  ^  ( ^2/. )  - 1  ^2/i 
Correct  v^  on  ^^  i^^hyew  ^  ^^hyld  ^  jh^^2h 
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II^  and  I^h,  symbolize  the  restriction  and  prolongation  procedures.  The  quantity  v^ 
is  the  current  approximation  to  the  discrete  solution  u^.  The  algebraic  error  is  the 
difference  between  them,  e^  —  u^  —  v^.  The  discretization  error  is  the  difference 
between  the  exact  solutions  of  the  continuous  and  discrete  problems,  enscr  =  u  —  u^. 
The  truncation  error  is  obtained  by  substituting  the  exact  solution  into  the  discrete 
equation, 

t''  =  A^u  -S^  =  A^u  -  A^u^.  (4.2) 

The  notation  above  follows  Briggs  [10]. 

The  two-level  multigrid  cycle  begins  on  the  fine  grid  with  u  iterations  of  the 
smoother.  Standard  iterative  methods  all  have  the  "smoothing  property,"  which  is 
that  the  various  eigenvector-decomposed  components  of  the  solution  error  are  damped 
at  a  rate  proportional  to  their  corresponding  eigenvalues,  i.e.  the  high  frequency 
errors  are  damped  faster  than  the  low  frequency  (smooth)  errors.  Thus,  the  conver- 
gence rate  of  the  smoothing  iterations  is  initially  rapid,  but  deteriorates  as  smooth 
error  components,  those  with  large  eigenvalues,  dominate  the  remaining  error.  The 
purpose  of  transferring  the  problem  to  a  coarser  grid  is  to  make  these  smooth  error 
components  appear  more  oscillatory  with  respect  to  the  grid  spacing,  so  that  the 
initial  rapid  convergence  rate  is  obtained  for  the  elimination  of  these  smooth  errors 
by  coarse-grid  iterations.  Since  the  coarse  grid  Vt^^  has  only  1/4  as  many  grid  points 
as  Q.^  (in  2-d),  the  smoothing  iterations  on  the  coarse  grid  are  cheaper  as  well  as 
more  effective  in  reducing  the  smooth  error  components  than  on  the  fine  grid. 

In  the  correction  scheme,  the  coarse-grid  problem  is  an  equation  for  the  algebraic 
error, 

A^'^e^^  =  r^\  (4.3) 
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approximating  the  fine-grid  residual  equation  for  the  algebraic  error.  To  obtain  the 
coarse-grid  source  term,  r^'',  the  restriction  procedure  II^  is  applied  to  the  fine-grid 
residual  r'', 

Eq.  4.4  is  an  averaging  type  of  operation.  Two  common  restriction  procedures  are 
straight  injection  of  fine-grid  values  to  their  corresponding  coarse-grid  grid  points, 
and  averaging  r''  over  a  few  fine-grid  grid  points  which  are  near  the  corresponding 
coarse-grid  grid  point.  The  initial  error  on  the  coarse  grid  is  taken  as  zero. 

After  the  solution  for  e^''  is  obtained,  this  coarse-grid  quantity  is  interpolated  to 
the  fine  grid  and  used  to  correct  the  fine-grid  solution, 

v''  f-  v""  +  4e2^  (4.5) 

For  /j/,,  common  choices  are  bilinear  or  biquadratic  interpolation. 

In  practice  the  solution  for  e'^^  is  obtained  by  recursion  on  the  two-level  cycle — 
{A^^)~^  is  not  explicitly  computed.  On  the  coarsest  grid,  direct  solution  may  be 
feasible  if  the  equation  is  simple  enough.  Otherwise  a  few  smoothing  iterations  can 
be  applied. 

Recursion  on  the  two-level  algorithm  leads  to  a  "V  cycle,"  as  shown  in  Figure  4.1. 
A  simple  V(3,2)  cycle  is  shown.  Three  smoothing  iterations  are  taken  before  re- 
stricting to  the  next  coarser  grid,  and  two  iterations  are  taken  after  the  solution  has 
been  corrected.  The  purpose  of  the  latter  smoothing  iterations  is  to  smooth  out 
any  high-frequency  noise  introduced  by  the  prolongation.  Other  cycles  can  be  envi- 
sioned. In  particular  the  W  cycle  is  popular  [6].  The  cycling  strategy  is  called  the 
"grid-schedule,"  since  it  is  the  order  in  which  the  various  grid  levels  are  visited. 

The  most  important  consideration  for  the  correction  scheme  has  been  saved  for 
last,  namely  the  definition  of  the  coarse-grid  discrete  equation  A^^.  One  possibility  is 
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to  discretize  the  original  differential  equation  directly  on  the  coarse  grid.  However  this 
choice  is  not  always  the  best  one.  The  convergence-rate  benefit  from  the  multigrid 
strategy  is  derived  from  the  particular  coarse-grid  approximation  to  the  fine-grid 
discrete  problem,  not  the  continuous  problem.  Because  the  coarse-grid  solutions 
and  residuals  are  obtained  by  particular  averaging  procedures,  there  is  an  implied 
averaging  procedure  for  the  fine-grid  discrete  operator  A^  which  should  be  honored 
to  ensure  a  useful  homogenization  of  the  fine-grid  residual  equation.  This  issue  is 
critical  when  the  coefficients  and/or  dependent  variables  of  the  governing  equations 
are  not  smooth  [17]. 

For  the  Poisson  equation,  the  Galerkin  approximation  A'^^  =  Il^A^l2h,  is  the 
right  choice.  The  discretized  equation  coefficients  on  the  coarse  grid  are  obtained 
by  applying  suitable  averaging  and  interpolation  operations  to  the  fine-grid  coeffi- 
cients, instead  of  by  discretizing  the  governing  equation  on  a  grid  with  a  coarser 
mesh  spacing.  Briggs  has  shown,  by  exploiting  the  algebraic  relationship  between 
bilinear  interpolation  and  full-weighting  restriction  operators,  that  initially  smooth 
errors  begin  in  the  range  of  interpolation  and  finish,  after  the  smoothing-correction 
cycle  is  applied,  in  the  null  space  of  the  restriction  operator  [10].  Thus,  if  the  fine-grid 
smoothing  eliminates  all  the  high-frequency  error  components  in  the  solution,  one  V 
cycle  using  the  correction-scheme  is  a  direct  solver  for  the  Poisson  equation.  The  con- 
vergence rate  of  multigrid  methods  using  the  Galerkin  approximation  is  more  difficult 
to  analyze  if  the  governing  equations  are  more  complicated  than  Poisson  equations, 
but  significant  theoretical  advantages  for  application  to  general  linear  problems  have 
been  indicated  [90]. 
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4.1.2     Full- Approximation  Storage  Scheme  for  Nonlinear  Equations 

The  brief  description  given  above  does  not  bring  out  the  complexities  inherent  in 
the  application  to  nonlinear  problems.  There  is  only  experience,  derived  mostly  from 
numerical  experiments,  to  guide  the  choice  of  the  restriction/prolongation  procedures 
and  the  smoother.  Furthermore,  the  linkage  between  the  grid  levels  requires  special 
considerations  because  of  the  nonlinearity. 

The  correction  scheme  using  the  Galerkin  approximation  can  be  applied  to  the 
nonlinear  Navier-Stokes  system  of  equations  [94].  However,  in  order  to  use  CS  for 
nonlinear  equations,  linearization  is  required.  The  best  coarse-grid  correction  only 
improves  the  fine-grid  solution  to  the  linearized  equation.  Also,  for  complex  equa- 
tions, considerable  expense  is  incurred  in  computing  A^'^  by  the  Galerkin  approxi- 
mation. The  commonly  adopted  alternative  is  the  intuitive  one,  to  let  A^^  be  the 
differential  operator  L  discretized  on  the  grid  with  spacing  2h  instead  of  h.  In  ex- 
change for  a  straightforward  problem  definition  on  the  coarse  grid  though,  special 
restriction  and  prolongation  procedures  may  be  necessary  to  ensure  the  usefulness  of 
the  resulting  corrections.  Numerical  experiments  on  a  problem-by-problem  basis  are 
necessary  to  determine  good  choices  for  the  restriction  and  prolongation  procedures 
for  Navier-Stokes  multigrid  methods. 

The  full-approximation  storage  (FAS)  scheme  [5]  is  preferred  over  the  correction 
scheme  for  nonlinear  problems.  The  coarse-grid  corrections  generated  by  FAS  improve 
the  solution  to  the  full  nonlinear  problem  instead  of  just  the  linearized  one.  The 
discretized  equation  on  the  fine  grid  is,  again, 

A^u^  =  5\  (4.6) 
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The  approximate  solution  v''  after  a  few  fine-grid  iterations  defines  the  residual  on 
the  fine  grid, 

A'^v'  =  S^  +  r\  (4.7) 

A  correction,  the  algebraic  error  ej;^  =  u'^  -v^,\s  sought  which  satisfies 

The  residual  equation  is  formed  by  subtracting  Eq.  4.7  from  Eq.  4.8,  and  cancelling 

A^[v^ +  e^)-A^{v^)  =  -r^,  (4.9) 

where  the  subscript  "alg"  is  dropped  for  convenience.  For  linear  equations  the  A^v^ 
terms  cancel  leaving  Eq.  4.3.  Eq.  4.9  does  not  simplify  for  nonlinear  equations. 
Assuming  that  the  smoother  has  done  its  job,  r^  is  smooth  and  Eq.  4.9  is  the  same 
as  the  coarse-grid  residual  equation 

^2h^^2h  ^  ^2h^  _  A^\v^'')  =  -r'\  (4.10) 

at  coarse-grid  grid  points. 

The  error  e^''  is  to  be  found,  interpolated  back  to  f]''  according  to  e^  =  /2/je^'', 
and  added  to  v^  so  that  Eq.  4.8  is  satisfied.  The  known  quantities  are  v'^'',  which  is  a 
"suitable"  restriction  of  v'',  and  r^'',  likewise  a  restriction  of  r^.  Diff'erent  restrictions 
can  be  used  for  residuals  and  solutions.  Thus,  Eq.  4.10  can  be  written 

A'^ll'v'^  +  e^'')  =  A^'^ill'^v'')  -  llW  (4.11) 

Since  Eq.  4.11  is  not  an  equation  for  e^'',  one  solves  instead  for  the  sum  Il^v^  +  e^^. 
Expanding  r''  and  regrouping  terms,  Eq.  4.11  can  be  written 

A^\u'^)     =    A^'^ill'v^)  -  Il^r^  (4.12) 
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A^^Il^v^)  -  /f  (AS'^)  +  /f  ^'^  -  52"  +S 


-.2/1 
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2h 
numerical 


I     q2h 


(4.13) 
(4.14) 


Eq.  4.14  is  similar  to  Eq.  4.6  except  for  the  extra  numerically-derived  source  term. 
Once  Il^v^  +  e^''  is  obtained  the  coarse-grid  approximation  to  the  fine-grid  error,  e^'', 
is  computed  by  first  subtracting  the  initial  coarse-grid  solution  Il^v^, 


e^^  =  u'^  -  Il''v\ 


(4.15) 


then  interpolating  back  to  the  fine  grid  and  combining  with  the  current  solution, 


h     I      r/i    /^2/i\ 


v''  ^  v'  +  ik^^n 


(4.16) 


4.1.3     Extension  to  the  Navier-Stokes  Equations 

The  incompressible  Navier-Stokes  equations  are  a  system  of  coupled,  nonlinear 
equations.  Consequently  the  FAS  scheme  given  above  for  single  nonlinear  equations 
needs  to  be  modified. 

The  variables  uj,  u^.  and  t/g  represent  the  cartesian  velocity  components  and  the 
pressure,  respectively.  Corresponding  subscripts  are  used  to  identify  each  equations' 
source  term,  residual  and  discrete  operator  in  the  formulation  below.  The  three 
equations  for  momentum  and  mass  conservation  are  treated  as  if  part  of  the  following 
matrix  equation, 

'  A\  0  CM  r  wi  1    r  5f " 

0     A^^    G(;         u'i      =      S^     .  (4.17) 

.  G^    ^t     0    J  [  u',  \       [S^  _ 

The  continuity  equation  source  term  is  zero  on  the  finest  grid,  fj'*,  but  for  coarser  grid 
levels  it  may  not  be  zero.  Thus,  for  the  sake  of  generality  it  is  included  in  Eq.  4.17. 
Thus,  for  the  ifi-momentum  equation  Eq.  4.8  is  modified  to  account  for  the 
pressure-gradient,  G^Ug,  which  is  also  an  unknown.   The  approximate  solutions  are 
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vf,  v^,  and  ^3  corresponding  to  Uj ,  tij,  and  u^.  For  the  ui-momentum  equation,  the 
approximate  solution  satisfies 

A^f  +  Gtvl  =  5f  +  r^  (4.18) 

The  fine-grid  residual  equation  corresponding  to  Eq.  4.9  is  modified  to 

AM  +  ef )  -  A>i)  +  G'Av's  +  4)  -  G'M)  =  -''i ^  (4-19) 

which  is  approximated  on  the  coarse  grid  by  the  corresponding  coarse-grid  residual 
equation, 

Al\vl'  +  ef )  -  AiHvl')  +  Gf  (^f  +  e^'')  -  Cf  (^;f )  =  -rf  (4.20) 

The  known  terms  are  v^^  =  /f  uf ,  vf  =  Il'^v^,  and  r^^  =  ifr'^. 
Expanding  r'l  and  regrouping  terms,  Eq.  4.19  can  be  written 

Af{uf)  +  Gf{uf)    =    A\\ll'v\)  +  Gl\PM)  (4.21) 

-il\Ay,  +  ay,)  +  ii'si 

-    [A\\PM)^Gf{liy) 

-ll\Ay  +  G^)  +  ll'Sl  -  Si']  +  Sf 


t2/i 


ah  I     cm 

^\,numerical       '        1     ' 

Since  Eq.  4.22  includes  numerically  derived  source  terms  in  addition  to  the  physical 
ones,  the  coarse-grid  variables  are  not  in  general  the  same  as  would  be  obtained  from 
a  discretization  of  the  original  continuous  governing  equations  on  the  coarse  grid. 

The  W2-momentum  equation  is  treated  similarly,  and  the  coarse-grid  continuity 
equation  is 

Gfuf  +  Gfuf  =  Gf{lfu\)  +  Gfiliy)  -  ll'rl  (4.22) 


94 


The  system  of  equations  Eq.  4.17  are  solved  by  either  the  pressure-correction 
method  (sequential)  or  the  locally-coupled  explicit  method  described  in  the  next 
section. 

In  addition  to  the  choice  of  the  smoother,  the  specification  of  the  coarse-grid 
discrete  problem  {A'^'')  is  critical  to  the  convergence  rate,  and  to  the  stability  of 
the  multigrid  iterations  as  well.  In  the  description  of  the  FAS  scheme  for  the  2-d 
incompressible  Navier-Stokes  equations  presented  earlier,  no  mention  was  made  of 
the  coarse  grid  discretization.  Intuitively,  one  would  use  the  same  discretization  for 
each  of  the  terms  as  on  the  fine  grid.  For  example,  if  the  convection  terms  are  central- 
differenced  on  the  fine  grid,  then  central-differencing  should  be  used  on  the  coarse 
grid,  also.  However,  with  such  an  approach  numerical  stability  frequently  becomes  a 
problem,  particularly  in  high  Reynolds  number  flow  problems. 

4.2     Comparison  of  Pressure-Based  Smoothers 

The  single-grid  convergence  rate  of  pressure-correction  methods  for  the  incom- 
pressible Navier-Stokes  equations  depends  strongly  on  the  discretization  of  the  non- 
linear convection  terms,  the  Reynolds  number,  and  the  importance  of  the  pressure- 
velocity  coupling  in  the  fluid  dynamics.  The  grid  size  and  quality  can  also  affect  the 
convergence  rate  in  curvilinear  formulations.  These  issues  carry  over  to  the  multigrid 
context  and  are  complicated  by  the  interplay  between  the  evolving  solutions  on  the 
multiple  grid  levels. 

Two  pressure-based  methods  are  popular  smoothers.  The  first  is  the  pressure- 
correction  method  studied  in  Chapter  2  and  3,  and  the  other  is  Vanka's  locally- 
coupled  explicit  method  [89]  briefly  introduced  in  Chapter  1.  Much  attention  has 
been  focused  on  comparing  the  performance  of  these  two  methods  in  the  multigrid 
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context,  i.e.  as  smoothers.  The  semi-implicit  pressure-correction  methods,  due  to 
their  implicitness,  are  better  single-grid  solvers. 

In  the  locally-coupled  explicit  method,  pressure  and  velocity  are  updated  in  a 
coupled  manner  instead  of  sequentially.  A  finite-volume  implementation  on  a  stag- 
gered grid  is  employed.  The  pressure  and  the  velocities  on  the  faces  of  each  p  control 
volume  are  updated  simultaneously. 

However  the  simultaneous  update  of  pressure  and  velocity  is  only  for  one  control 
volume  at  a  time.  Underrelaxation  is  again  necessary  due  to  the  decoupling  between 
control  volumes.  The  control  volumes  are  traversed  by  the  lexicographical  ordering 
with  the  most  recently  updated  u  and  v  values  used  when  available.  Thus  the  original 
method  is  called  BGS  (for  "block  Gauss-Seidel").  After  one  sweep  of  the  grid  each 
u  and  V  have  been  updated  twice  and  each  pressure  once.  A  red-black  ordering 
suitable  for  parallel  computation  has  been  developed  in  this  research.  By  analogy, 
this  algorithm  is  called  BRB  (block  red-black). 

For  the  (i,j)th  pressure  control  volume,  the  continuity  equation  is  written  in  terms 
of  the  velocity  corrections  needed  to  restore  mass  conservation: 

(u:^i,^-<^)Ay-f «,^i-<,)Ax  =  {ul-ul,^^)Ay  +  {vl-vl^^,)Ax  =  R^,  (4.23) 

where  /?^,  is  the  mass  residual  in  the  (i,j)th  control  volume.  The  notation  follows 
the  development  in  Chapter  2  except  now  that  pressure  and  velocity  are  coupled  it 
is  necessary  to  refer  to  the  (i,j)  notation  on  occasion.  In  Figure  2.3,  u^j,  is  Ujj,  Ue  is 
Uj+i,j,  Vs  is  u,,j,  and  u„  is  Uij+i. 

The  discrete  u-momentum  equation  for  the  (i,j)th  p  control  volume  is  written 

a>-,,  +  P^+uAy  =     E    «^4  +  (pL  -  p:+i,j)^y  -  ^>h  =  -Kj     (4-24) 

k=E,W,N,S 
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The  discretized  momentum  equations  for  the  three  other  faces  of  the  pressure  control 
volume  are  written  analogously,  giving  a  system  of  five  equations  in  five  unknowns, 

(«p)t,; 


(<)«+i.. 
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Ay 
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■     -Rl: 

-Ay 
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(4.25) 


The  solution  of  this  matrix  equation  is  done  by  hand  for  pjj, 


Pi,j 


'■J  (°p).,j      ^      (°p).+l,j  (°p).,j  (°p).,j+i 


(4.26) 


The  velocity  corrections  are  found  by  back-substitution.     The  entire  procedure  is 
summarized  in  the  following  algorithm. 

BRB(u*,r'*,p'';u;„t,,aJc) 

Compute  u'  coefficient  dp{u*,v*)  and  residual  /?"j,    ^{i-,j) 

Compute  v'  coefficient  ap{u',v*)  and  residual  R^^^,    ^{i-,3) 

Compute  p-j,  back-substitute  for  u-^^,  u'+ij,  u-^,  u-_^^i   V(i,i)  h '+ J  =  oc^<^ 

Correct  all  u,  u,  and  odd  p 

(analogous  corrections  for  u,+i,j,  ■i3,,j,  i^'ij+i,  and  pij) 
Compute  u'  coefficient  ap{u,v)  and  residual  i?"j,    ^{i,j) 
Compute  v'  coefficient  ap[u,v)  and  residual  R\^,    V(i,j) 
Compute  p-^,  back  substitute  for  u'^,  U'+ij,  t^-j,  «^-,j+i   V(i,  j)  |  i  -|- J  =  even 
Correct  all  u,  v,  and  even  p 

(analogous  corrections  for  Ui+ij,  Uij,  ^^iJ^-l,  and  pij) 


97 


In  general  the  convergence  rate  in  the  multigrid  context  is  different  between  SIM- 
PLE and  BRB.  Linden  et  al.  [50]  stated  a  preference  for  the  locally-coupled  explicit 
smoother  rather  than  pressure-correction  methods.  The  argument  the  authors  gave 
was  that  the  local  coupling  of  variables  is  better  suited  to  produce  local  smoothing 
of  residuals,  i.e.  faster  resolution  of  the  local  variations  in  the  solution.  This  is  be- 
lieved to  allow  a  more  accurate  coarse-grid  approximation  of  the  fine-grid  problem. 
Similar  reasoning  appears  to  have  been  applied  in  the  original  development  [89], 
by  Ferziger  and  Peric  [22],  and  by  Ghia  et  al.  [28].  Linden  et  al.  [50]  did  a  sim- 
plified Fourier  analysis  of  locally-coupled  smoothing  for  the  Stokes  equations  and 
confirmed  good  smoothing  properties  of  the  locally-coupled  explicit  method.  Shaw 
and  Sivaloganathan  [71]  have  found  that  SIMPLE  (with  the  SLUR  solver)  also  has 
good  smoothing  properties  for  the  Stokes  equations,  assuming  that  the  pressure- 
correction  equation  is  solved  completely  during  each  iteration.  Thus  there  is  some 
analytical  evidence  that  both  pressure-correction  methods  and  the  locally-coupled 
explicit  technique  are  suitable  as  multigrid  smoothers.  However,  the  analytical  work 
is  oversimplified — numerical  comparisons  are  needed  on  a  problem-by-problem  basis. 

Sockol  [80]  has  compared  the  performance  of  BGS,  two  line-updating  variations 
on  BGS,  and  the  SIMPLE  method  with  successive  line-underrelaxation  for  the  inner 
iterations.  Three  model  flow  problems  were  tested  with  different  physical  charac- 
teristics and  varying  grid  aspect  ratios:  lid-driven  cavity  flow,  channel  flow,  and  a 
combined  channel/cavity  flow  ("open  cavity").  In  terms  of  work  units,  Sockol  found 
that  all  four  smoothers  were  competitive  for  lid-driven  cavity  flow  over  a  range  of 
Re  from  100  to  5000.  For  the  developing  channel  flow,  BGS  and  its  line-updating 
variants  converged  faster  than  SIMPLE  on  square  grids,  but  as  the  grid  aspect  ratio 
increased  SIMPLE  became  competitive. 


98 


Brandt  and  Yavneh  [8]  have  developed  a  line-relaxation-based  multigrid  method 
which  handles  pressure  and  velocity  sequentially.  Good  convergence  rates  were  ob- 
served for  "entering-type"  flow  problems  in  which  the  flow  has  a  dominant  direction 
and  is  aligned  with  grid  lines.  Line-relaxation  has  the  effect  of  providing  non-isotropic 
error  smoothing  properties  to  match  the  physics  of  the  problem.  Wesseling  [91]  an- 
alyzed several  line-relaxation  methods,  and  concluded  that  alternating  line-Jacobi 
relaxation  had  robust  smoothing  properties  and,  somewhat  unexpectedly,  that  it  was 
a  better  choice  than  SLUR. 

For  pressure-based  smoothers,  numerical  experimentation  apparently  has  created 
some  intuition  regarding  the  relative  performance  of  sequential  and  locally-coupled 
smoothers  in  model  flow  problems,  but  many  of  the  issues  have  not  been  investigated 
systematically.  Further  research  perhaps  should  not  be  directed  toward  the  goal  of 
picking  one  method  over  the  other.  General  conclusions  are  unlikely  because  the 
convergence  rate  is  dependent  on  the  particular  flow  problem.  Instead,  both  types  of 
smoothers  should  continue  to  be  implemented  and  tested  in  the  multigrid  context, 
not  to  determine  a  preference  but  rather  to  build  understanding  for  their  application 
to  complex  flow  problems. 

The  cost  per  iteration  of  BRB  and  SIMPLE  are  comparable  on  serial  computers. 
If  /y^  =  j/^  =  1  and  f c  =  4  successive  line-underrelaxation  inner  iterations  are  used, 
SIMPLE  costs  about  30%  more  per  iteration  than  BGS  [80].  BGS  and  BRB  are 
identical  in  terms  of  run  time  on  a  serial  computer. 

The  relative  cost  is  different  on  parallel  computers  though.  Figures  4.2,  4.3  and 
4.4  compare  the  parallel  run  time  per  iteration  of  BRB  with  SIMPLE  on  a  128- VU 
CM-5,  i.e.  (32  SPARC  nodes  each  controlling  4  vector  units),  for  a  fixed  number  of 
iterations  (500)  of  the  single-grid  BRB  and  SIMPLE  solvers.  The  convection  terms 
are  central-differenced  and,  for  SIMPLE,  point-Jacobi  inner  iterations  are  used  with 
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i/^  =  Uy  —  3  and  Vc  =  9.  The  problem  size  is  given  in  terms  of  the  virtual  processor 
ratio;  the  largest  problem  size  in  Figures  4.2,  4.3  and  4.4  is  10^  grid  points. 

Figure  4.2  indicates  that  SIMPLE  and  BRB  have  virtually  the  same  cost  per  500 
iterations  and  that  this  cost  scales  linearly  with  the  problem  size  on  a  fixed  number 
of  processors.  Figure  4.3  shows  that  BRB  requires  almost  twice  as  much  time  on 
coefficient  computations,  but  only  about  half  as  much  on  solving  for  the  pressure 
changes  and  back-substituting.  The  coefficient  computation  cost  would  be  exactly 
twice  that  of  SIMPLE  except  for  the  small  contribution  from  the  computation  of  the 
p'-equation  coefficients  in  the  SIMPLE  procedure. 

Figure  4.4  shows  the  amount  of  time  spent  on  computation  and  interprocessor 
communication.  The  interprocessor  communication  cost  is  relatively  small  compared 
to  the  computation  cost.  Also,  the  sum  of  the  two  is  less  than  the  total  elapsed 
time  shown  in  Figure  4.2,  due  to  front-end-to-processor  communication.  The  relative 
time  spent  overall  and  in  computation  is  essentially  the  efficiency.  Thus,  the  results 
shown  in  Figures  4.2-4.4  are  summarized  by  the  point-Jacobi  curve  in  Figure  3.2. 
Furthermore,  the  breakdown  into  communication  and  computation  is  approximately 
the  same  for  both  SIMPLE  and  BRB,  so  in  terms  of  efficiency,  similar  characteristics 
for  BRB  are  expected  as  were  observed  in  Chapter  3  for  SIMPLE. 

In  Figures  4.2-4.4  the  SIMPLE  timings  will  be  different  if  line-Jacobi  inner  it- 
erations are  used  instead  of  point-Jacobi  inner  iterations.  The  parallel  efficiency  is 
reduced  and  the  actual  parallel  run  time  is  greater.  One  line-Jacobi  inner  iteration 
(consisting  of  two  tridiagonal  solves — one  treating  the  unknowns  implicitly  along  hor- 
izontal lines  and  the  other  for  the  vertical  lines)  using  the  cyclic  reduction  method 
introduced  in  Chapter  3  takes  about  8-10  times  as  long  as  one  point-Jacobi  iteration 
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on  the  CM-5.  Line-Jacobi  inner  iterations  are  therefore  not  preferred  over  point- 
Jacobi  inner  iterations  for  use  in  the  SIMPLE  algorithm  unless  the  benefit  to  the 
convergence  rate  is  substantial. 

The  line-updating  variants  of  BRB  (see  [80,  87])  are  even  worse  in  comparison  with 
BRB  than  the  line-Jacobi  SIMPLE  method  is  in  comparison  with  the  point-Jacobi 
SIMPLE  method— they  are  not  suitable  for  SIMD  computation.  The  line-updating 
variations  on  BGS  couple  pressures  and  velocities  between  control  volumes  along  a 
line  as  well  as  within  each  control  volume.  By  contrast,  in  sequential  pressure-based 
methods,  line-iterative  methods  are  used  within  the  context  of  solving  the  individual 
systems  of  equations,  so  only  a  single  variable  is  involved. 

On  the  staggered  grid,  the  unknowns  which  are  to  be  updated  simultaneously  in 
the  line-variant  of  BRB  are,  for  a  constant  j  Hue,  {p2,j,U3,j,P3,],  ■  ■  •  I'Un.-ijjPm-ijl- 
To  set  up  the  tridiagonal  system  of  equations  for  solving  for  these  unknowns  simul- 
taneously requires  coefficient  and  source-term  data  to  be  moved  from  arrays  which 
have  the  same  layout  as  the  u  and  p  arrays.  But  this  data  must  be  moved  to  an 
array (s)  which  has  a  longer  dimension  in  the  i-direction.  Instead  of  having  dimen- 
sion m.  the  array  which  contains  the  unknowns,  diagonals,  and  right-hand  sides  has 
dimension  2ni.  The  elements  l:ni  for  the  constant  j  line  of  u  and  the  u  coefficient 
arrays,  {u,ap,d^^a^,ag,a'y^,lf),  must  be  moved  into  positions  l:2n?:2.  Similar  data 
movement  is  required  for  the  p  coefficients  and  data.  Thus,  "SEND"-type  commu- 
nication will  be  generated  during  each  iteration  to  set  up  the  tridiagonal  system  of 
equations  along  the  lines.  This  type  of  communication  is  prohibitively  expensive  in 
an  algorithm  where  all  the  other  operations  are  relatively  fast  and  efficient. 

Thus,  if  line-relaxation  smoothing  is  required  to  solve  a  particular  flow  problem  for 
either  a  single-grid  or  a  multigrid  computation  on  the  CM-5,  the  pressure-correction 
methods  should  be  used.    Otherwise,  either  BRB  or  SIMPLE-type  methods  can  be 
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used,  if  time  per  iteration  is  the  only  consideration.  With  u^  —  Vv  =  3,  Vc  =  9,  and 
point-Jacobi  inner  iterations,  SIMPLE  and  BRB  have  essentially  the  same  parallel 
cost  and  efficiency. 

4.3     Stability  of  Multigrid  Iterations 

It  is  well  known  that  central-difference  discretizations  of  the  convection  terms  in 
the  Navier-Stokes  equations  may  be  unstable  if  cell  Peclet  numbers  are  greater  than 
two,  depending  on  the  boundary  conditions  [73].  The  coarse-grid  level(s)  have  higher 
cell  Peclet  numbers.  Consequently,  multigrid  iterations  may  diverge,  driven  by  the 
divergence  of  smoothing  iterations  on  coarse  grids,  if  central-differencing  is  used.  The 
convection  terms  on  coarse  grids  may  need  to  be  upwinded  for  stability.  However, 
second-order  accuracy  is  usually  desired  on  the  finest  grid.  The  "stabilization  strat- 
egy" is  the  approach  used  to  provide  stability  of  the  coarse-grid  discretizations  while 
simultaneously  providing  second-order  accuracy  for  the  fine-grid  solution. 

The  naive  stabilization  strategy  is  to  simply  discretize  the  convection  terms  with 
first-order  upwinding  on  the  coarse-grid  levels  and  by  second-order  central-differencing 
on  the  finest  grid.  Unfortunately,  the  naive  approach  does  not  work — there  is  a  "mis- 
match" between  the  solutions  on  neighboring  levels  if  different  convection  schemes 
are  employed,  resulting  in  poor  coarse-grid  corrections.  In  practice  divergence  usually 
results.  The  coarse-grid  discretization  needs  to  be  consistent  with  the  fine-grid  dis- 
cretization in  order  that  an  accurate  approximation  of  the  fine-grid  residual  equation 
is  generally  possible. 

In  the  present  work  a  "defect-correction"  stabilization  strategy  is  employed  as 
in  [80,  81,  87,  89].  The  convection  terms  on  all  coarse  grids  are  discretized  by  first- 
order  upwinding.  The  convection  terms  on  the  finest  grid  are  also  upwinded,  but  a 
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source-term  correction  is  applied  which  allows  second-order  central-difference  accu- 
racy to  be  obtained,  when  the  multigrid  iterations  have  converged. 

Another  approach  is  to  use  a  stable  second-order  accurate  convection  scheme, 
e.g.  second-order  upwinding,  on  all  grid  levels  [74].  Shyy  and  Sun  [74]  have  used 
different  convection  schemes  on  all  grid  levels  and  compared  the  convergence  rates. 
Central-differencing,  first-order  upwinding,  and  second-order  upwinding  were  tested 
for  Re  =  100  and  Re  =  1000  lid-driven  cavity  flow  problems.  Comparable  conver- 
gence rates  were  obtained  for  all  three  convection  schemes,  whereas  for  single-grid 
computations  there  are  relatively  large  differences  in  the  convergence  rates.  Central- 
differencing  was  unstable  for  the  Re  =  1000  case,  but  a  hybrid  strategy  with  second- 
order  upwinding  on  the  coarsest  three  grid  levels  and  central-differencing  on  the  finer 
grid  levels  remedied  the  problem  without  deteriorating  the  convergence  rate.  Further 
study  of  this  issue  is  conducted  in  Chapter  5,  in  which  the  convergence  rate  and  sta- 
bility characteristics  of  second-order  upwinding  on  all  grid  levels  is  contrasted  with 
the  defect-correction  strategy. 

A  third  possibility  is  simply  to  add  extra  numerical  viscosity  to  the  physical 
viscosity  on  coarse  grids.  This  technique  has  been  investigated  by  Fourier  analysis 
for  a  model  linear  convection-diffusion  equation  in  [93].  The  authors'  best  strategy 
was  the  one  in  which  the  amount  of  numerical  viscosity  was  taken  to  be  proportional 
to  the  grid  spacing  on  the  next  (finer)  multigrid  level.  For  the  Navier-Stokes  this 
brute-force  approach  is  not  expected  to  perform  very  well  because  the  solutions  on 
the  fine  grids  are  frequently  not  just  a  smooth  continuation  of  the  lower  Reynolds 
number  flow  problems  being  solved  on  the  coarse  grid  levels.  Rather,  fundamental 
changes  in  the  fluid  dynamics  occur  as  Reynolds  number  increases. 
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4.3.1     Defect-Correction  Method 

In  the  defect-correction  approach,  the  discretized  equations  for  a  variable  4>  are 
derived  as  follows.  In  general,  the  equations  have  the  form 

a'^(f>P  =  a'^i^E  +  a'^ct>w  +  a^<^;v  +  05-^5  +  hf  (4.27) 

where  the  superscript  "ce"  denotes  that  central-differencing  of  the  convection  terms. 
To  form  the  discrete  defect-correction  equation,  the  corresponding  first-order  up- 
winded  discrete  equation  is  added  to  and  subtracted  from  Eq.  4.27  and  rearranged 
to  give 

[(a^i  -  a'^^)cj>p  -  {al'  -  a'^i)(t>E  -  («h'  "  <^w)4>w  - 

«  -  a^^)0,v  -  {af  -  a^i)^s  -  {b^}}  -  6^^)]  (4.28) 

where  the  superscript  "ul"  denotes  the  first-order  upwinding  of  the  convection  terms. 
The  term  in  brackets  is  equal  to  the  difference  in  residuals,  so  Eq.  4.28  can  be  written 

a^p'cf>P  =  al'ci^E  +  a^ct^w  +  a^^Viv  +  a^Vs  +  bf  +  K  -  ^"]-  (4-29) 

To  obtain  the  updated  solution,  the  difference  in  residuals  is  lagged.  Thus  Eq.  4.29 
for  the  solution  at  iteration  counter  "n-f  1"  with  the  residuals  evaluated  at  iteration 
counter  "n"  is  written 

af^P  =  al'<f>E  +  a^i^d^w  +  a]VViv  +  af<f>s  +  bf  +  [r"i  -  r'^'Y ■  (4-30) 

Moving  the  first  five  terms  on  the  right-hand  side  to  the  left-hand  side,  Eq.  4.30  can 
be  rewritten  concisely  as 

[r"']"+'  =  [r"i]"  -  [r^^]",  (4.31) 
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in  which  it  is  easily  seen  that  satisfaction  of  the  second-order  central-difference  equa- 
tion discretized  equations,  r'^^  — >  0,  is  recovered  when  [r"^]"+^  is  approximately  equal 
to  [r"^]". 

Table  4.1  compares  the  convergence  rates  for  single-grid  SIMPLE  computations 
using  four  popular  convection  schemes,  for  a  lid-driven  cavity  flow  problem.  The 
purpose  is  to  gain  some  intuition  regarding  the  convergence  properties  of  the  defect- 
correction  scheme.  For  all  the  cases  presented  in  the  table,  the  grid  size  was  81  x  81. 
The  table  gives  the  number  of  iterations  required  to  converge  both  of  the  momentum 
equations  to  the  level  llru||  <  —5.0,  where  the  Li  norm  is  used,  divided  by  the  number 
of  grid  points. 

The  inner  iterative  procedure  for  computing  an  approximate  solution  to  the  u,  v, 
and  p'  systems  of  equations,  during  the  course  of  the  outer  iterations  of  the  SIMPLE 
algorithm,  is  listed  in  column  2.  In  the  line-Jacobi  method,  all  the  horizontal  lines 
are  solved  simultaneously,  followed  by  the  vertical  lines,  during  a  single  inner  itera- 
tion. The  SLUR  procedure  (same  technique  as  in  Chapter  2)  also  alternates  between 
horizontal  and  vertical  lines.  In  addition,  the  grid  lines  are  swept  one  at  a  time  in  the 
direction  of  increasing  i  or  j,  in  the  Gauss-Seidel  fashion,  instead  of  all  at  once  as  in 
the  line-Jacobi  method.  The  number  of  inner  iterations  for  each  governing  equation 
was  Uu  =  Uy  =  .3,  and  f^  =  9  in  the  Re  =  1000  problem.  These  parameters  are 
increased  to  5,  5,  and  10  for  the  Re  =  .3200  flow.  The  inner  iteration  damping  factor 
for  the  line-Jacobi  iterative  method  was  0.7. 

For  the  Re  =  1000  cases,  the  SIMPLE  relaxation  factors  are  0.4  for  the  momen- 
tum equations  and  0.7  for  the  pressure.  The  convergence  rate  of  defect-correction 
iterations  is  not  quite  as  good  as  central-differencing  or  first-order  upwinding,  but  it  is 
slightly  better  than  second-order  upwinding.  This  result  is  anticipated  for  cases  where 
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Inner 

Convection  Scheme 

Flow 

Iterative 

Problem 

Method 

First-order 

Defect 

Central 

Second-order 

Upwinding 

Correction 

Differencing 

Upwinding 

Re  =  1000  Cavity 

Point-Jacobi 

2745 

3947 

1769 

4419 

Re  -  1000  Cavity 

Line-Jacobi 

2442 

3497 

1543 

3610 

Re  =  1000  Cavity 

SLUR 

2433 

3482 

1534 

3568 

Re  =  3200  Cavity 

Point-Jacobi 

16526 

>  20000 

12302 

>  20000 

Re  =  3200  Cavity 

Line-Jacobi 

16462 

>  20000 

12032 

>  20000 

Re  =  3200  Cavity 

SLUR 

16458 

>  20000 

11985 

>  20000 

Table  4.1.  Number  of  single-grid  SIMPLE  iterations  to  converge  to  ||r„||  <  10"^,  for 
the  lid-driven  cavity  flow  on  an  81  x  81  grid.  The  Li  norm  is  used,  normalized  by 
the  number  of  grid  points. 


central-differencing  does  not  liave  stability  problems,  since  the  defect-correction  dis- 
cretization is  a  less-implicit  version  of  central-differencing.  Likewise  one  should  expect 
the  convergence  rate  of  SIMPLE  with  the  defect-correction  convection  scheme  to  be 
slightly  slower  than  with  the  first-order  upwind  scheme  due  to  the  presence  of  source 
terms  which  vary  with  the  iterations.  The  method  (line-Jacobi,  point-Jacobi,  SLUR) 
used  for  inner  iterations  has  no  influence  on  the  convergence  rate  for  either  Reynolds 
number  tested.  From  experience  it  appears  that  the  lid-driven  cavity  flow  is  unusual 
in  this  regard.  For  most  problems  the  inner  iterative  procedure  makes  a  significant 
difference  in  the  convergence  rate. 

For  the  Re  =  3200  cases,  the  relaxation  factors  were  reduced  until  a  converged 
solution  was  possible  using  central-differencing.  Then  these  relaxation  factors,  0.1 
for  the  momentum  equations  and  0.3  for  pressure,  were  used  in  conjunction  with 
the  other  convection  schemes.  Actually,  in  the  lid-driven  cavity  flows,  the  pressure 
plays  a  minor  role  in  comparison  with  the  balance  between  convection  and  diffusion. 
Consequently,  the  pressure  relaxation  factor  can  be  varied  between  0.1  and  0.5  with 
negligible  impact  on  the  convergence  rate.  The  convergence  rate  is  very  sensitive  to 
the  momentum  relaxation  factor,  however.    The  Re  =  3200  cavity  flow  is  hard  to 
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converge,  and  neither  the  defect-correction  or  second-order  upwind  schemes  succeeds 
for  these  relaxation  factors.  Second-order  central  differencing  does  not  normally 
look  this  good  either.  The  lid-driven  cavity  flow  is  a  special  case  for  which  central- 
difference  solutions  can  be  obtained  for  relatively  high  Reynolds  numbers  due  to 
the  shear-driven  nature  of  the  flow  and  the  relative  unimportance  of  the  pressure- 
gradient.  For  the  Re  =  3200  case,  the  convergence  paths  of  the  four  convection 
schemes  tested  are  shown  in  Figure  4.5.  None  of  the  convection  schemes  is  diverging, 
but  the  amount  of  smoothing  appears  to  be  insufficient  to  handle  the  source  terms 
in  the  2nd-order  upwind  and  defect-correction  schemes  for  this  Reynolds  number. 

4.3.2     Cost  of  Different  Convection  Schemes 

There  was  initially  some  concern  that  the  source  term  evaluations  in  the  defect- 
correction  and/or  second-order  upwind  convection  schemes  might  be  expensive  in 
terms  of  the  parallel  run  time.  In  light  of  Figure  4.3,  it  is  of  interest  to  know  whether 
the  cost  per  iteration  is  significantly  increased,  as  this  consequence  might  lead  one 
to  favor  one  convection  scheme  over  another  for  considerations  of  run  time,  if  both 
have  satisfactory  convergence  rate  characterisitics.  Figure  4.6  compares  the  cost  of 
computing  the  coefficients  of  the  discrete  u,  v,  and  p'  equations,  for  three  convection 
schemes.  The  timings  were  obtained  on  a  32-node  (128  vector  unit)  CM-5  for  500 
SIMPLE  iterations. 

Since  the  smoother  and  the  coefficient  computations  are  the  most  time-consuming 
tasks  in  the  SIMPLE  algorithm,  the  cost  of  the  inner  iterations  (the  "solver")  is 
included  for  comparison  purposes  (the  solid  line).  There  are  15  point- Jacobi  inner 
iterations  per  outer  iteration,  distributed  3  each  on  the  momentum  equations  and  9 
on  the  p'-system  of  equations. 
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The  timings  were  obtained  over  a  range  of  problem  sizes,  for  500  SIMPLE  itera- 
tions. The  ar-axis  in  Figure  4.6  plots  problem  size  in  terms  of  the  virtual  processor 
ratio  VP.  VP  is  preferred  over  the  number  of  grid  points  so  that  the  results  can 
be  carried  over  to  CM-5s  with  more  processors.  The  coefficient  cost  scales  linearly 
with  problem  size  and,  with  the  defect-correction  scheme,  requires  about  the  same 
time  as  solving  the  equations.  If  more  inner  iterations  were  used,  or  the  more  costly 
line-Jacobi  method  was  used,  the  fraction  of  the  overall  run  time  due  to  the  computa- 
tion of  coefficients  would  decrease.  The  linear  scaling  with  VP  is  possible  due  to  the 
uniform  boundary  coefficient  computation  implementation,  discussed  in  Chapter  3. 

The  figure  also  shows  that  second-order  upwinding  of  the  convection  terms  costs 
more  than  the  other  schemes,  by  approximately  50%.  Additional  testing  has  shown 
that  the  first-order  upwind,  hybrid,  central-difference,  and  defect-correction  schemes 
all  use  roughly  the  same  amount  of  time. 

More  details  are  shown  in  Figure  4.7,  which  breaks  down  the  time  spent  com- 
puting coefficients  into  computation  and  interprocessor  communication.  Because  the 
difference  stencils  are  compact,  only  nearest-neighbor  processing  elements  need  to 
communicate  in  the  calculation  of  the  equation  coefficients.  These  are  "NEWS"- 
type  communications  on  the  CM-5.  In  the  present  implementation,  the  coefficient 
computations  for  the  momentum  equations  require  9  NEWS  communications  for  the 
defect-correction,  central-differencing,  and  first-order  upwind  schemes.  Second-order 
upwinding  requires  at  least  13  NEWS  communications.  In  the  present  implemen- 
tation 17  communication  operations  are  needed  because  the  formulation  supports 
nonuniform  grids  and  therefore  some  geometric  quantities  need  to  be  communicated 
in  addition  to  the  nearby  velocities.  The  additional  NEWS  communication  is  appar- 
ent in  Figure  4.7.  Similarly,  the  second-order  upwind  scheme  involves  more  compu- 
tation than  the  other  schemes. 
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Coincidentally,  the  additional  computation  and  interprocessor  communication  of 
the  second-order  upwind  convection  scheme  offset  each  other  in  terms  of  their  affect 
on  the  parallel  efficiency.  With  either  convection  scheme  the  trend  is  essentially  the 
same,  Figure  4.8.  Figure  3.2  gave  the  variation  of  E  with  V P  for  central-differencing. 

4.4     Restriction  and  Prolongation  Procedures 

The  discretization  of  the  convection  terms  on  coarse  grids  is  a  key  issue  because  the 
coarse  grid  problem  must  be  a  reasonable  approximation  to  the  fine-grid  discretized 
equation,  in  order  to  obtain  good  corrections.  In  addition,  for  the  formulation  given 
in  the  background  section,  one  must  also  say  how  the  coarse-grid  source  terms  are 
computed,  and  how  the  corrections  are  interpolated  to  the  fine  grid.  The  restric- 
tion and  prolongation  procedures  affect  both  the  stability  and  convergence  rate.  In 
this  section,  three  restriction  procedures  and  two  prolongation  procedures  have  been 
compared  on  two  model  problems  with  different  physical  characteristics  to  assess  the 
effect  of  the  intergrid  transfer  procedures  on  the  multigrid  convergence  rate. 

For  finite-volume  discretizations,  conservation  is  the  natural  restriction  procedure 
for  the  equation  residuals,  because  the  terms  in  the  discrete  equations  represent 
integrals  over  an  area.  The  method  of  integration  for  source  terms  determines  the 
actual  restriction  procedure.  For  piecewise  constant  treatment  of  source  terms  in  a 
cell-centered  finite-volume  discretization,  the  mass  residual  in  a  coarse-grid  control 
volume  is  the  sum  of  the  mass  residuals  in  the  four  fine-grid  control  volumes  which 
comprise  the  coarse-grid  control  volume.  This  restriction  procedure  is  used  for  the 
residuals  of  the  continuity  equation  in  every  case  tested. 

If  the  mass  residual  is  summed,  and  v'l  and  Uj  are  restricted  by  cell-face  averaging 
(described  below),  the  right-hand  side  of  Eq.  4.22  is  identically  zero  [80],  which  implies 
that  the  velocity  field  on  coarse  grids  also  satisfies  the  continuity  equation,  in  addition 
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to  the  velocity  field  on  the  finest  grid.  However,  it  is  not  necessary  to  have  identically 
zero  coarse-grid  source  terms,  even  in  the  continuity  equation. 

Restriction  procedure  "3"  obtains  the  initial  coarse-grid  solutions  not  by  restrict- 
ing the  solutions,  but  instead  by  taking  the  most  recently  computed  values  on  the 
coarse  grid.  These  values  will  be  from  the  previous  multigrid  cycle.  The  u-momentum 
equation  residuals  are  summed  over  the  six  fine-grid  u  control  volumes  which  comprise 
the  coarse-grid  u  control  volume  under  consideration.  Only  half  the  contribution  is 
taken  from  the  cell-face  neighbor  u  control  volumes  due  to  the  staggered  grid. 

For  the  restriction  procedure  denoted  "1,"  u,  u,  and  the  momentum  equation 
residuals  are  restricted  by  cell-face  averaging.  Cell-face  averaging  refers  to  the  aver- 
aging of  the  two  fine-grid  u  velocity  components  immediately  above  and  below  the 
coarse-grid  u  location,  which  are  on  the  same  coarse-grid  p  control  volume  face.  Sim- 
ilar treatment  is  applied  to  v.  The  coarse-grid  pressures  are  obtained  by  averaging 
the  four  nearest  fine-grid  pressures. 

The  restriction  procedure  "2"  indicates  a  weighted  average  of  six  fine-grid  u  ve- 
locity components,  the  cell-face  ones  and  their  nearest-neighbors  on  either  side.  The 
cell-face  fine-grid  u  velocity  components  contribute  twice  as  much  as  their  neighbors. 
Similar  treatment  is  applied  for  v,  and  for  the  momentum  equation  residuals.  The 
coarse-grid  pressures  are  obtained  by  averaging  the  four  nearest  fine-grid  pressures, 
as  in  restriction  procedure  1. 

For  the  prolongation  procedures,  "1"  and  "2"  indicate  bilinear  and  biquadratic 
interpolation,  respectively.  The  bilinear  interpolation  procedure  is  identical  to  that 
used  by  Shyy  and  Sun  [74],  in  which  the  two  nearest  coarse-grid  corrections  along  a 
line  X  =  constant  (for  u)  are  used  to  compute  the  correction  at  the  location  of  the 
fine-grid  u  velocity  component,  by  linear  interpolation.  Similar  treatment  is  adopted 
for  V  corrections.  To  compute  the  corrections  on  the  "in-between"  fine-grid  lines  the 
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available  fine-grid  corrections  are  interpolated  linearly.  Corrections  for  pressure  are 
interpolated  linearly  from  the  four  nearest  coarse-grid  values. 

The  biquadratic  interpolation  procedure,  "2,"  is  similar  to  the  procedure  used 
by  Bruneau  and  Jouron  [12].  It  finishes  in  exactly  the  same  way  as  the  bilinear  in- 
terpolation, but  is  preceded  by  a  quadratic  (instead  of  linear)  interpolation  in  the 
y-direction,  and  an  averaging  in  the  x-direction.  Thus,  the  three  nearest  correction 
quantities  on  the  coarse  grid  (above  and  below  the  fine-grid  u  location)  are  used  to 
interpolate  in  the  y-direction  for  a  correction  located  at  the  position  of  the  fine-grid 
u  velocity  component.  After  this  y-direction  interpolation  there  are  two  corrections 
defined  on  each  face  of  the  coarse-grid  u  control  volumes,  at  the  locations  correspond- 
ing to  the  locations  of  the  fine-grid  u  velocity  components.  These  are  injected  to  give 
the  fine-grid  corrections  at  these  points  after  a  weighted  averaging  in  the  x -direction. 
For  example,  on  a  uniform  grid  this  pre-injection  averaging  goes  like: 

Uccorril.  J)  =   {Uc.corr{I  +  1,  J)  +  2ltc,corr(/,  J)  +  Uc,corr{I  "   I,  J))/ A,  (4.32) 

where  Uc,corr  and  the  capitalized  indices  indicate  that  the  correction  quantities  are 
still  defined  on  the  coarse  grid — they  are  positioned  to  correspond  with  the  fine-grid 
u  locations.  After  the  averaged  corrections  are  injected  to  the  fine  grid,  the  fine- 
grid  corrections  are  defined  along  every  other  line  x  =  constant.  The  corrections 
on  "in-between"  lines  are  linearly  interpolated  from  the  injected,  averaged  correc- 
tions. Similar  treatment  is  adopted  for  the  v  corrections.  Corrections  for  pressure 
are  interpolated  biquadratically  from  the  nine  nearest  coarse-grid  values. 

Table  4.2  below  compares  the  various  intergrid  transfer  procedures  in  terms  of 
the  work  units  required  to  reach  a  prescribed  convergence  tolerance  on  the  finest  grid 
level.  The  notation  (p,r)  indicates  the  number  of  the  prolongation  and  restriction 
procedures  adopted.     The  convergence  tolerance  on  the  fine  grid  is  prescribed  by 
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an  estimate  of  the  truncation  error  of  the  fine-grid  discretization,  which  is  derived 
in  Chapter  5.  The  criterion  is  typically  not  very  stringent,  so  the  table  results  best 
reflect  difl^erences  in  the  initial  convergence  rate  instead  of  the  asymptotic  convergence 
rate. 


Number  of  wort 

.  units  to  converge 

(P,r) 

Re  =  1000  Cavity 

Re  =  400  Back-Step 

V(2,l) 

V(3,2) 

V(2,l) 

V(3,2) 

(1,1) 

19.0 

23.6 

123.2 

95.7 

(2,1) 

21.8 

28.5 

110.0 

166.6 

(1,2) 

16.9 

24.4 

168.9 

181.7 

(2.2) 

20.2 

20.5 

263.5 

122.4 

(1,3) 

12.7 

13.6 

div 

51.8 

(2,3) 

14.1 

13.8 

239.5 

59.6 

Table  4.2.  The  effect  of  different  restriction  and  prolongation  procedures  on  the  con- 
vergence rate  of  the  pressure- correction  multigrid  algorithm,  for  a  7-level  cavity  flow 
problem  with  a  322  x  322  fine  grid,  and  for  a  5-level  symmetric  backward-facing  step 
flow  with  a  322  x  82  fine  grid.   The  defect-correction  approach  is  used. 


Numerical  experiments  with  the  number  of  pre-  and  post-smoothing  iterations 
have  shown  that  for  the  cavity  flow,  V(2,l)  cycles  provide  enough  smoothing.  V(3,2) 
cycles  are  needed  for  the  symmetric  backward-facing  step  flow  computation.  With 
less  smoothing  the  number  of  work  units  to  reach  convergence  generally  increases 
even  though  the  number  of  work  units  per  cycle  is  smaller. 

The  restriction  procedure  used  appears  to  be  very  important  to  the  convergence 
rate  in  either  flow  problem.  The  restriction  procedure  3  appears  to  perform  better 
than  1  or  2.  The  discussion  presented  earlier  suggested  this  result.  However,  since  the 
residuals  are  summed  instead  of  averaged  they  are  typically  larger,  with  more  spatial 
variation  also.  As  a  result,  more  smoothing  iterations  are  needed  to  ensure  stability 
of  the  multigrid  iterations.  For  r  =  3,  it  appears  that  the  bilinear  interpolation 
procedure  (p  =  1)  converges  slightly  faster  than  the  biquadratic  procedure. 
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The  performance  of  the  other  restriction  procedures  appears  to  depend  on  the 
prolongation  procedure.  In  both  problems  the  best  results  for  r  =  1  or  r  =  2  are 
obtained  when  the  corresponding  (p  =  1  or  p  =  2)  prolongation  procedure  is  used. 
In  the  backward-facing  step  flow,  the  results  for  cell-face  averaging  (r  =  1)  are  better 
than  the  six-point  averaging  by  a  significant  amount.  The  same  is  true  for  the  cavity 
flow  but  to  a  lesser  degree.  The  eifect  of  Reynolds  number  for  each  flow  problem 
should  be  considered  in  future  work. 

Figures  4.9,  and  4.10  give  a  different  look  at  the  relative  performance  of  the  1 
and  3  restriction  procedures,  cell-face  averaging  of  solutions  and  residuals  contrasted 
with  summation  of  residuals  only.  The  focus  is  on  the  asymptotic  convergence  rate 
as  opposed  to  the  initial  convergence  rate  considered  in  Table  4.2.  The  u-momentum 
equation  average  residual  (the  Li  norm  divided  by  the  number  of  grid  points)  is 
plotted  on  each  grid  level  against  work  units.  V(3,2)  cycles  and  bilinear  interpolation 
(p  =  1)  were  used  for  the  symmetric  backward-facing  step  flow  calculation. 

The  computations  have  been  carried  far  beyond  the  point  at  which  convergence 
was  declared  in  Table  4.2.  The  dashed  line  shows  the  estimated  truncation  error  on 
the  fine  grid  used  to  declare  convergence  for  the  table.  Brandt  and  Yavneh  have 
argued  that  this  level  of  convergence  should  be  sufficient  [9].  Further  multigrid  cycles 
reduce  the  algebraic  error  but  not  necessarily  the  differential  error. 

With  restriction  procedure  1,  Figure  4.9,  the  initial  multigrid  convergence  rate  is 
rapid,  but  levels  off  significantly  after  about  100  work  units.  This  apparently  slow 
asymptotic  multigrid  convergence  rate  is  still  much  better  than  the  single-grid  conver- 
gence rate  for  this  flow  problem,  indicating  that  there  is  some  benefit  being  obtained 
from  the  coarse-grid  corrections  with  the  restriction  procedure  1.  The  corrections  are 
evidently  not  as  large  as  with  restriction  procedure  3  (Figure  4.10),  because  this  case 
shows  no  reduction  in  the  initial  rapid  convergence  rate.    It  has  been  verified  that 


113 


the  convergence  rate  is  maintained  until  the  level  of  double-precision  roundoff  error 
(-15.0)  is  reached,  although  the  convergence  path  is  shown  only  down  to  -8.0.  These 
figures  support  the  earlier  observation  that  the  restriction  procedure  3  is  appropriate 
to  the  finite-volume  discretization.  The  difference  between  the  performance  of  the 
restriction  procedures  1  and  3  is  even  more  dramatic  in  the  lid-driven  cavity  flow, 
Figures  4.11  and  4.12. 

The  convergence  rate  of  the  present  multigrid  method  appears  to  be  comparable 
to  other  results  in  the  literature.  Sockol  [80]  found  that  roughly  30  work  units  were 
needed  to  obtain  convergence  for  the  lid-driven  cavity  flow  at  Re  =  1000,  for  both 
BGS  and  SIMPLE.  The  residuals  were  summed  as  in  restriction  procedure  3,  but  the 
variables  were  also  restricted,  by  cell-face  averaging.  W(l,l)  cycles  were  used.  Shyy 
and  Sun  [74]  needed  many  more  work  units  to  reach  convergence,  using  V  cycles  at 
the  same  Reynolds  number  but  with  less  resolution  on  the  fine  grid  (81  x  81).  The 
restriction  procedure  1  was  used.  The  convergence  criterion  was  tighter,  and  there 
were  procedural  differences  from  the  present  work  and  that  of  Sockol  which  may  also 
account  for  the  differences. 

4.5     Concluding  Remarks 

Multigrid  techniques  are  potentially  scalable  parallel  computational  methods, 
both  in  the  numerical  sense  and  the  computational  sense.  The  key  issue  for  applying 
multigrid  techniques  to  the  incompressible  Navier-Stokes  equations  is  the  connection 
between  the  evolving  solutions  on  the  various  grid  levels,  which  includes  the  transfer 
of  information  between  coarse  and  fine  grids,  i.e.  the  restriction  and  prolongation 
procedures,  and  the  formulation  of  the  coarse-grid  problem,  i.e.  the  choice  of  the 
coarse-grid  convection  scheme.  These  factors  also  influence  the  stability  of  multigrid 
iterations. 
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The  restriction  procedure  for  finite-volume  discretizations  should  be  summing  of 
residuals.  Also,  it  was  found  unnecessary  to  restrict  the  solution  variables.  The 
convergence  rate  in  both  types  of  flow  problems,  shear  and  pressure-driven,  were 
significantly  accelerated  when  the  residuals  were  summed  instead  of  averaged.  How- 
ever, because  the  residuals  are  larger,  more  smoothing  is  found  to  be  necessary  to 
avoid  stability  problems,  in  the  symmetric  backward-facing  step  flow.  The  bilinear 
prolongation  procedure  appears  to  be  preferrable  to  the  biquadratic  prolongation 
procedure.  The  convergence  rates  which  have  been  achieved  in  the  model  problems 
are  comparable  to  other  results  in  the  literature. 

In  terms  of  cost  per  iteration,  it  appears  that  the  pressure-correction  type  smoother 
is  comparable  to  the  locally-coupled  explicit  method  on  the  CM-5,  whereas  for  serial 
computations  the  latter  has  been  favored  by  some  [80].  Both  algorithms  consist  of 
basically  the  same  operations,  with  roughly  twice  as  much  influence  on  the  parallel 
run  time  from  the  coefficient  computations,  for  BRB.  The  coefficient  computation 
cost  is  comparable  to  the  smoothing  cost  for  the  SIMPLE  method,  but  for  BRB  the 
former  is  the  dominant  consideration.  In  that  respect,  the  uniform  implementation 
for  boundary  coefficient  computations  described  in  Chapter  3  and  the  choice  of  con- 
vection scheme  are  very  important  considerations.  Using  the  second-order  upwind 
scheme,  the  cost  per  iteration  of  SIMPLE,  assuming  3,  3,  and  9  point-Jacobi  in- 
ner iterations,  is  roughly  twice  as  much  compared  to  the  defect-correction  scheme, 
although  there  is  negligible  effect  on  the  parallel  efficiency. 
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VP=8k 


VP  =  2k 


VP=512 


VP=128 


V(3,2)  Multigrid  cycle 


Level  4  (fine  grid) 

Level  3 

Level  2 

Level  1  (coarse  grid) 

(3)  =  3  smoothing  iterations 


Figure  4.1.  Schematic  of  a  V(3,2)  multigrid  cycle,  which  has  three  smoothing  itera- 
tions on  the  "downstroke"  of  the  V  and  2  smoothing  iterations  on  the  "upstroke." 
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Smoother  Comparison 
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Figure  4.2.  Comparison  of  the  total  parallel  run  time  for  SIMPLE  and  BRB  on  a  128 
vector-unit  CM-5  for  500  iterations  over  a  range  of  problem  sizes.  The  flow  problem 
which  was  timed  was  Re  =  1000  lid-driven  cavitv  flow. 
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Smoother  Comparison 
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Figure  4.3.  Comparison  of  the  parallel  run  times  for  SIMPLE  and  BRB,  decomposed 
into  contributions  from  the  coefficient  computations  and  the  solution  steps  in  these 
algorithms.  The  time  are  obtained  on  a  128  vector-unit  CM-5  for  500  iterations  over 
a  range  of  problem  sizes.  The  convection  terms  are  central-differenced. 
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Figure  4.4.  Comparison  of  the  parallel  run  time  for  SIMPLE  and  BRB,  decom- 
posed into  contributions  from  parallel  computation  and  nearest-neighbor  interproces- 
sor  communication  ("NEWS").  The  timings  were  made  on  a  128  vector-unit  CM-5 
for  500  iterations  over  a  range  of  problem  sizes. 
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Single-Grid  Convergence  Paths  for  Re=3200  Case 
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Figure  4.5.  Decrease  in  the  norm  of  the  u-momentum  equation  residual  as  a  function 
of  the  number  of  SIMPLE  iterations,  for  different  convection  schemes.  The  results 
are  for  a  single-grid  simulation  of  Re  =  3200  lid-driven  cavity  flow  on  an  81  x  81 
grid.  The  alternating  line-Jacobi  method  is  used  for  the  inner  iterations.  The  results 
do  not  change  significantly  with  the  point-Jacobi  or  the  SLUR  solver. 
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Cost  of  Coefficient  Computations 
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Figure  4.6.  Comparison  between  two  convection  schemes,  in  terms  of  parallel  run 
time.  The  total  (computation  +  communication)  time  spent  computing  coefficients 
over  500  SIMPLE  iterations,  on  a  128- VU  CM-5,  is  plotted  against  the  virtual  pro- 
cessor ratio,  VP.  "Solver  time"  is  the  time  spent  on  15  point-Jacobi  inner  iterations 
per  SIMPLE  iteration,  3,  3,  and  9  for  the  u,  v,  and  p'  systems  of  equations.  It  is  just 
coincidental  that,  for  the  defect-correction  and  central-difference  cases,  the  coefficient 
computations  and  the  solver  time  are  about  equal. 
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NEWS  &  CPU  Costs  in  Coefficient  Computations 
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Figure  4.7.  For  the  second-order  upwind  and  defect-correction  schemes,  the  time 
spent  in  coefficient  computations  for  500  SIMPLE  iterations  is  decomposed  into  con- 
tributions from  computation,  denoted  "CPU",  and  from  nearest-neighbor  interpro- 
cessor  communication,  denoted  "NEWS".  These  quantities  are  plotted  against  the 
virtual  processor  ratio,  VP.  Times  are  for  a  128-VU  CM-5. 
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CM-5  SIMPLE  Code:  E  vs.  VP  for128  VUs 


Q 


X 

o 


o 

X 


o 

X 


-8 


X  2nd-order  upwind  scheme 
o  Defect-correction  scheme 
Solver=  point-Jacobi 
iterations  (3,  3,  and  9) 


0 


2000 


4000 


6000 


8000 


10000 


VP 


Figure  4.8.  Parallel  efficiency,  E  for  a  range  of  problem  sizes.  E  —  TiJupTp,  where 
Ti  is  the  serial  execution  time,  estimated  by  multiplying  the  measured  computa- 
tion time  per  processor  by  the  number  of  processors,  rip.  Tp  is  the  elapsed  CM-5 
run  time,  including  computation,  interprocessor  and  front-end-to-processor  types  of 
communication. 
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Figure  4.9.  Convergence  path  on  each  grid  level  for  a  .5-level  V(3,2)  multigrid  cycle. 
The  fine  grid  is  322  x  82.  The  flow  problem  is  a.  Re  =  400  symmetric  backward-facing 
flow.  Bilinear  interpolation  (p  =  1)  and  cell-face  averaging  for  restriction  (r  =  1)  are 
used. 
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Figure  4.10.  Convergence  path  on  each  grid  level  for  a  5-level  V(3,2)  multigrid  cycle. 
The  fine  grid  is  322  x  82.  The  flow  problem  is  a  /?e  =  400  symmetric  backward-facing 
flow.  Bilinear  interpolation  (p  =  1)  and  summation  of  residuals  for  restriction  (r  = 
1)  are  used. 
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Figure  4.11.  Convergence  path  on  each  grid  level  for  a  7-leveI  V(2,l)  multigrid  cycle. 
The  fine  grid  is  322  x  322.  The  flow  problem  is  Re  =  1000  lid-driven  cavity  flow. 
Bilinear  interpolation  (p  =  1)  and  cell-face  averaging  for  restriction  (r  =  1)  are  used. 
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Figure  4.12.  Convergence  path  on  each  grid  level  for  a  7-level  V(2,l)  multigrid  cycle. 
The  fine  grid  is  322  x  322.  The  flow  problem  is  Re  =  1000  lid-driven  cavity  flow. 
Bilinear  interpolation  (p  =  1)  and  summation  of  residuals  for  restriction  (r  =  3)  are 
used. 


CHAPTER  5 
IMPLEMENTATION  AND  PERFORMANCE  ON  THE  CM-5 

This  chapter  describes  the  implementation  on  the  CM-5  of  the  multigrid  method 
studied  previously,  and  applies  the  parallel  code  to  two  model  flow  problems  to  assess 
the  performance  both  in  terms  of  the  convergence  rate  and  the  cost  per  iteration.  The 
major  implementational  consideration  for  the  CM-5  multigrid  algorithm  is  the  storage 
problem. 

The  starting  procedure  by  which  an  initial  guess  is  generated  for  the  fine  grid  is  an 
important  practical  technique  whose  cost  on  parallel  computers  is  of  interest.  Also, 
the  starting  procedure  is  important  in  the  sense  that  the  initial  guess  can  affect  the 
stability  of  the  subsequent  multigrid  iterations  and  the  convergence  rate.  The  cycling 
strategy  is  discussed  next.  It  also  affects  both  the  run  time  and  the  convergence  rate. 
Because  of  the  nonneglible  smoothing  cost  of  coarse  grids,  the  comparison  between  V 
and  W  cycles  in  terms  of  the  time  per  cycle  is  different  than  on  serial  computers  and 
needs  to  be  assessed  for  the  CM-5.  The  purpose  of  the  chapter  is  to  provide  some 
practical  guidance  regarding  the  use  of  the  numerical  method  on  the  CM-5,  now  that 
the  choice  for  the  smoother,  the  coarse-grid  discretization,  and  the  restriction  and 
prolongation  procedures  has  been  addressed. 

Finally,  the  computational  scalability  of  the  parallel  implementation  is  studied 
using  timings  for  a  range  of  problem  sizes  and  numbers  of  processors.  With  the 
experience  gained  with  regards  to  the  choice  of  algorithm  components  and  practi- 
cal techniques,  this  information  gives  a  clear  picture  of  the  potential  of  the  present 
approach  for  scaled-speedup  performance  on  massively-parallel  SIMD  machines. 
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5.1     Storage  Problem 

Multigrid  algorithms  pose  implementational  problems  in  Fortran,  because  the 
language  does  not  support  recursion.  A  variable  number  of  multigrid  levels  must  be 
accommodated  but  care  must  be  taken  not  to  waste  memory.  Let  NI{k)  and  NJ{k) 
be  arrays  denoting  the  grid  extents  on  the  kth  multigrid  level,  where  k  =  I  refers 
to  the  coarsest  grid  and  k  =  k^ax  is  the  finest  grid.  The  dimension  extents  on  the 
fine  grid  are  parameters  of  the  problem.  For  an  array  A,  the  different  grid  levels 
are  made  explicit  by  adding  a  third  array  dimension.  This  is  a  natural  albeit  naive 
storage  declaration, 

PARAMETER  {Nl(kma^)   =   1024,  NJikmar)   =   1024,  kmar   =  7) 
REAL*8  A(  NI{kmar),  NJ(kmar),  kmax   ) 

Unfortunately,  this  approach  wastes  storage  because  every  grid  level  is  dimen- 
sioned to  the  extents  of  the  finest  grid.  The  coarse  grids  are  significantly  smaller, 
though,  decreasing  in  size  by  factor  of  4  for  each  level  beneath  the  top  level  (the  fine 
grid).  The  total  amount  of  memory  used  in  this  approach  is  the  number  of  arrays, 
Uarray-:  multiplied  by  the  storage  cost  of  each  array, 

The  actual  storage  needed  is  only 

Storage  =  'fl  N I{k)N J{k)n^..^y  =  'g^  A^-^(^w)Ay(A:^a.)^^^^^^  ^._2) 

k=i  k=i  ^   '""'' 

The  actual  storage  needed  approaches  {4/3)NI{kmax)NJ{kmax)narray  as  kmax  in- 
creases. Thus  the  wasted  storage  is  {kmax  —  4/3)NI{k.max)NJ{kmax)narray  when  the 
naive  approach  is  used.  Clearly  this  can  become  the  dominating  factor  very  quickly 
as  the  number  of  levels  increases. 
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One  efficient  solution  for  serial  computation  is  to  declare  a  1-d  array  of  sufficient 
size  to  hold  all  the  data  on  all  levels  and  to  reshape  it  across  subroutine  boundaries, 
taking  advantage  of  the  fact  that  Fortran  passes  arrays  by  reference.  This  practice 
is  typical  in  serial  multigrid  algorithms  [63] .  A  1-d  array  section  of  the  appropriate 
length  for  the  grid  level  under  consideration  is  passed  to  a  subroutine  where  it  is 
received  as  a  2-d  array  with  the  dimension  extents  NI{k)  x  NJ{k). 

On  serial  computers,  this  reshaping  of  arrays  across  subroutine  boundaries  is  pos- 
sible because  the  physical  layout  of  the  array  is  linear  in  the  computer's  memory.  On 
distributed  memory  parallel  computers  like  the  CM-5,  however,  the  storage  problem 
is  not  so  easily  resolved  because  the  data  arrays  are  not  physically  in  a  single  pro- 
cessor memory,  they  are  distributed  among  the  processors.  Instead  of  being  passed 
by  reference  as  is  the  case  with  Fortran  on  serial  computers,  data-parallel  arrays  are 
passed  to  subroutines  by  "descriptor"  on  the  CM-5.  The  array  descriptor  is  a  front- 
end  array  containing  18  elements.  The  descriptor  contains  information  about  the 
array  being  described:  the  layout  of  the  physical  processor  mesh,  the  virtual  subgrid 
dimensions,  the  rank  and  type  of  the  array,  the  name  and  so  on. 

On  the  CM-5  the  storage  problem  is  resolved  using  array  "aliases."  Array  aliasing 
is  a  form  of  the  Fortran  EQUIVALENCE  function  used  on  serial  computers.  In  the 
multigrid  algorithm,  storage  for  each  variable  is  initially  declared  for  all  grid  levels, 
explicitly  referencing  the  physical  layout  of  the  processors.  For  example,  an  array  A 
with  fine-grid  dimension  extents  NI{kmax)  x  NJ{kmax),  is  declared  as  follows  for  a 
128- VU  CM-5  with  the  processors  arranged  in  an  (rj^  =  8)  x  (n^  =  16)  mesh: 


PARAMETER  (N,erial  =  {i/^)NI{kmaT)NJ(kmar)/n,,,  nj,  =  8,  4  =  16) 

^'  n-'  ) 


REAL*8  A(  Nseriat,n'^i 


Actually,  the  factor  4/3  needs  to  be  increased  slightly  to  account  for   "array 
padding."     Each  physical  processor  must  be  assigned  exactly  the  same  number  of 
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virtual  processors  in  the  SIMD  model,  since  all  processors  do  the  same  thing  at  the 
same  time.  Thus,  in  general  the  array  dimensions  on  each  level  must  be  "padded" 
to  fit  exactly  onto  the  processor  mesh.  For  example,  an  80  x  80  fine  grid  with  5 
multigrid  levels  has  coarse  grids  with  dimensions  40  x  40,  20  x  20,  10  x  10  and  5x5. 
To  fit  onto  the  processor  mesh  with  exactly  the  same  subgrid  shape  and  size  for  each 
physical  processor,  assuming  an  8  x  16  processor  mesh,  the  storage  allocated  must 
be  88  x  96  +  48  x  48  +  24  X  32  +  16  x  16  +  8  X  16  (on  the  coarsest  grid  VP  =  1). 
Thus  the  actual  declared  storage  needs  to  be  slightly  more  than  that  shown  above. 

The  array  A  is  mapped  to  the  processors  using  the  compiler  directives  discussed  in 
Chapter  3.  The  first  dimension  extent  of  A  is  the  actual  storage  needed  per  physical 
processor.  It  is  laid  out  linearly  in  each  physical  processor's  memory  by  the  :SERIAL 
specification  in  the  LAYOUT  compiler  directive  (recall  Chapter  3  example).  The 
latter  two  dimensions  are  parallel  (:NEWS),  laid  out  across  the  physical  processor 
mesh. 

Then,  to  access  the  A  arrays  corresponding  to  each  grid  level,  array  aliases  (alter- 
nate front-end  array  descriptors  for  the  same  physical  data)  are  created  as  described 
in  Chapter  3.  For  example,  an  equivalence  is  established  between  the  "array  section" 
A(l:88*96/(8*16), 1:8, 1:16)  and  another  array  with  dimensions  (1:88,1:96).  In  this 
way  arrays  can  be  referenced  inside  subroutines  as  if  they  had  the  dimensions  of 
the  alias,  with  both  dimensions  parallel.  In  this  case  a  (:NEWS,:NEWS)  layout  of 
A(88,96)  can  be  declared,  even  though  in  the  calling  routine  the  data  come  from  an 
array  of  a  different  shape. 

This  feature,  array  aliasing,  is  relatively  new  in  the  CM-Fortran  compiler  evolution 
(version  2.1-Beta  [84])  and  has  not  yet  been  implemented  by  MasPar  in  their  compiler. 
Previous  multigrid  algorithms  on  SIMD  computers  were  restricted  to  either  the  naive 
approach  or  explicit  declaration  of  arrays  on  each  level  [18].  The  latter  approach  is 
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extremely  tedious  and  leads  to  very  large  front-end  executable  codes,  making  front- 
end  storage  a  concern.  Thus,  the  present  technique  for  getting  around  the  multigrid 
storage  problem,  although  requiring  some  programming  diligence,  is  critical  because 
it  permits  much  larger  multigrid  computations  to  be  attempted  on  SIMD-type  parallel 
computers.  As  observed  in  Chapter  3,  for  the  CM-5,  problem  sizes  of  the  order  of 
the  largest  possible  problem  sizes  are  necessary  to  obtain  good  parallel  efficiencies. 

5.2     Multigrid  Convergence  Rate  and  Stability 

The  "full  multigrid"  (FMG)  startup  procedure  [11]  is  shown  in  Figure  5.1.  It 
begins  with  an  initial  guess  on  the  coarsest  grid.  Smoothing  iterations  using  the 
pressure-correction  method  are  done  until  a  converged  solution  has  been  obtained. 
Then  this  coarsest-grid  solution  is  prolongated  to  the  next  grid  level  and  multigrid 
cycles  are  initiated  (at  level  2,  the  "next-to-coarsest"  grid  level).  Cycling  at  this  level 
continues  until  some  convergence  criterion  is  met.  The  solution  is  prolongated  to  the 
next  finer  grid  and  multigrid  cycling  resumes.  This  process  is  repeated  until  the  finest 
grid  level  is  reached.  The  converged  solution  on  level  kmar  —  1,  after  interpolation  to 
the  fine  grid,  is  a  much  better  initial  guess  than  is  possible  otherwise.  The  alternative 
is  to  use  an  arbitrary  initial  guess  on  the  fine  grid. 

For  Poisson  equations,  one  V  cycle  on  the  finest  grid  is  frequently  sufficient  to 
reach  a  converged  solution,  if  the  initial  guess  is  obtained  by  the  FMG  procedure. 
The  benefit  to  the  convergence  rate  of  a  good  initial  guess  more  than  offsets  the  cost 
of  the  V  cycles  on  coarse  grids  leading  up  to  the  finest  grid  level.  For  Navier-Stokes 
equations  the  cost/convergence  rate  tradeoff  still  favors  using  the  FMG  procedure, 
on  serial  computers.  For  parallel  computers,  however,  the  cost  of  the  FMG  procedure 
is  more  of  a  concern,  due  to  the  inefficiencies  of  smoothing  the  coarse  grids,  and  the 
potential  need  for  many  coarse-grid  cycles. 
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On  SIMD  computers,  the  smoothing  iterations  on  the  coarse  grid  levels  have  a 
fixed  baseline  time  set  by  the  communication  overhead  of  the  front-end-to-processor 
type.  Thus,  the  cost  of  the  FMG  procedure  is  increased  compared  to  serial  com- 
putation because  coarse-grid  smoothing  is  relatively  more  costly  (less  efficient)  than 
fine-grid  smoothing.  It  becomes  important,  with  regards  to  cost,  to  minimize  the 
number  of  coarse-grid  cycles,  without  sacrificing  the  benefit  of  a  good  initial  guess  to 
the  multigrid  convergence  rate. 

Tuminaro  and  Womble  [88]  have  recently  modelled  the  parallel  run  time  of  the 
FMG  cycle  on  a  distributed  memory  MIMD  computer,  a  1024-node  nCUBE2.  They 
developed  a  grid-switching  criterion  to  account  for  the  inefficiencies  of  smoothing  on 
coarse  grids.  The  grid-switching  criterion  effectively  reduces  the  number  of  coarse- 
grid  cycles  taken  during  the  FMG  procedure.  They  have  not  yet  reported  numerical 
tests  of  their  model,  but  the  theoretical  results  indicate  that  the  cost/convergence 
rate  tradeoff  can  still  favor  FMG  cycles  for  multigrid  methods  on  parallel  computers, 
with  their  technique.  In  the  next  section  a  truncation  error  estimate  is  developed 
and  then  used  to  control  the  amount  of  coarse-grid  cycling  in  the  FMG  procedure. 
The  validity  and  the  numerical  characteristics  of  the  truncation  error  estimate  are 
addressed. 

In  addition  to  the  cost  of  obtaining  the  initial  guess  on  the  fine  grid,  the  quality 
of  the  initial  guess  can  affect  both  the  convergence  rate  and  the  stability  of  the 
subsequent  multigrid  iterations,  depending  on  the  flow  problem  and  the  coarse-grid 
convection  scheme,  i.e.  the  stabilization  strategy.  The  performance  of  the  truncation 
error  criterion  in  this  regard  is  also  studied. 
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5.2.1     Truncation  Error  Convergence  Criterion  for  Coarse  Grids 

The  goal  of  a  given  discretization  and  numerical  method  is  to  obtain  an  approx- 
imate solution  to  Eq.  4.6,  v^,  which  nearly  satisfies  the  differential  equation,  i.e.  to 
achieve 

||A''ji-ylV||<e,  (5.3) 

for  some  small  t.  However,  u  is  unknown  and  there  are  many  complicating,  interacting 
factors  due  to  the  grid  distribution,  resolution,  the  discretization  of  the  nonlinear 
terms  and  the  proper  modelling  and  specification  of  boundary  conditions.  Thus  the 
conservative  philosophy  is  usually  adopted — assume  that  the  discretized  equation  is 
a  good  approximation  to  the  differential  equation  and  seek  the  exact  solution  to  the 
discrete  equation,  i.e.  seek  algebraic  convergence, 

\\A^u^  -  A'^v'^W  =  \\S^  -  A'^v'^W  =  llr'^ll  <  c,  (5.4) 

again  choosing  the  level  t  to  accommodate  any  imposed  constraints  on  the  run  time. 
Eq.  5.4  is  applied  on  the  finest  grid  in  a  multigrid  computation,  the  level  on  which 
the  solution  is  desired. 

The  coarse  grid  solution  obtained  in  the  FMG  procedure  has  only  one  purpose — 
to  yield  a  good  initial  guess  on  the  fine  grid.  The  "best"  initial  guess  is  the  one  that 
allows  Eq.  5.4  to  be  satisfied  on  the  fine  grid  quickest.  The  corresponding  coarse-grid 
solution  from  which  the  fine-grid  initial  guess  is  obtained  may  or  may  not  satisfy 
Eq.  5.4  with  e  =  0  itself.  It  is  not  always  beneficial  to  the  fine-grid  convergence  rate 
to  obtain  the  coarse-grid  solution  to  strict  tolerances. 

The  utility  of  a  coarse-grid  solution  for  the  purpose  of  providing  a  good  initial 
guess  on  the  fine  grid  depends  more  on  the  diff"erence  in  the  truncation  errors  of  the 
n"^^  and  Q,^  approximations  than  it  does  on  the  accuracy  of  the  coarse-grid  solu- 
tion. For  example,  in  highly  nonlinear  equations  or  in  problems  where  grid  levels  are 
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coarsened  by  factors  greater  than  two,  it  is  immediately  apparent  that  the  solution 
accuracy  in  the  coarse-grid  solution  to  the  discrete  problem  cannot  translate  into 
a  truly  accurate  initial  guess  on  the  fine  grid  no  matter  how  accurately  the  coarse 
grid  problem  is  solved.  The  usefulness  of  the  coarse-grid  solution  depends  on  the 
smoothness  of  the  physical  solution  and  the  prolongation  procedure. 

Consequently,  one  expects  that  the  most  cost-effective  procedure  for  controlling 
the  FMG  cycling  will  be  obtained  with  a  particular  set  of  coarse-grid  tolerances 
that  depend  on  the  flow  characteristics.  Thus  the  goal  should  be  to  discontinue  the 
FMG  cycles  on  a  particular  coarse  grid  level  when  Eq.  5.3  is  satisfied.  Frequently 
Eq.  5.3  is  satisfied  before  Eq.  5.4.  Similar  arguments  have  been  made  by  Brandt  and 
Ta'asan  [7]. 

Using  the  definitions  of  the  truncation  error,  Eq.  4.2,  and  the  residual,  the  triangle 
inequality  gives 

11^2;.^  _  A^W'W  <  IIA^^  -  A^'^u'^'W  +  IIA^V^  -  ^^S^^i  =  llr^'^ll  +  llr^'^ll.     (5.5) 

Thus,  if 

llr^'ll  =  e/2,  (5.6) 

Eq.  5.3  can  be  satisfied  if  the  residual  is  less  than  the  truncation  error, 

Ik^'^ll  <  Ik'll-  (5-7) 

Eq.  5.7  is  the  criterion  applied  to  the  coarse  grids,  while  Eq.  5.4  is  retained  for  the 

finest  level. 

To  develop  an  estimate  for  ||r^''||  in  Eq.  5.7,  consider  an  example  case  of  a  1-D 

nonlinear  convection  diffusion  equation  with  a  constant  or  position-dependent  source 

term, 

du        d^u  . 

ox         ox^ 
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For  a  finite-difference  discretization  with  central-differencing  for  both  derivative 
terms,  the  truncation  error  at  grid  point  "i"  on  the  grid  with  spacing  h  is  given  by 

'Ui+i  -  2ui  -(- w,-i\  _  ^ 


Ui 


•2h 


T 


(5.9) 


24 


-u,it-   +  ... 


where  u  is  the  differential  solution  at  the  position  x  =  ih 
Similarly  on  the  grid  with  spacing  2h, 

/U/+1  —  2u/  -I-  u/_i 


ui 


2h 


h^ 


S, 
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r 
4uh^ 
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The  grid  points  x  =  Ih  and  x  =  ih  correspond,  but  I-l-l  refers  to  the  point  at 
X  =  xj  +  2h,  whereas  i-(-l  refers  to  the  point  at  x  =  Xt  +  h.  Assuming  the  high-order 
terms  are  negligible  (debatable  for  fluid  flow  problems  unless  the  solution  is  very 
smooth),  and  subtracting  the  first  equation  from  the  second  (at  the  grid  points  of 
fi^''),  one  obtains 

'■U/+1  —  2ui  +  u/_i 


ui 


Ul+i  -«/-! 

2h 

Ui+i  —  u,_i 
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h^ 
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-., j^, 1-5,- 


=    3r^ 


In  operator  notation, 

A^^u  -  [a^u  -  S^]  =  3t\  (5.12) 

Substituting  the  most  current  approximation  v''  for  u  (at  the  coarse-grid  grid  points), 
and  the  approximate  values  v"^^  =  Il'^v^,  this  expression  becomes 


X^h^j2h^h^  _  g2h  _   [^h^,/z  _   c^/ij    ^  3^fc 


(5.13) 
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The  term  in  brackets  is  just  the  residual  r^  (at  the  corresponding  coarse-grid  grid 
point).  For  finite-difference  discretizations  this  residual  is  presumed  to  be  accurately 
approximated  by  Il'^r^.  Thus  the  truncation  error  of  the  fine-grid  discretization, 
estimated  at  the  coarse-grid  grid  points,  is 


T 


_;,_     ^2/^(^2/.^/.)  _  5-2;.  _  |2V 


r\-' 


(5.14) 


3 

This  expression,  however,  is  merely  the  numerically  derived  part  of  the  coarse-grid 
source  term,  ^^(^^eric./  in  Eq.  4.14.  Thus 

_h  ^,    '~' numerical 

3 

The  convergence  criteria  based  on  this  truncation  error  estimate,  Eq.  5.7,  becomes 


C2h 
^-  ^    '^numerical  ^  (5.15) 


\rH  < 


C2h 

O. 


numertca 


3  -  (5.16) 

The  norms  used  on  each  side  of  the  equation  should  be  divided  by  the  appropriate 
number  of  grid  points  (since  they  are  defined  on  different  grid  levels),  so  that  the 
quantities  represented  are  comparable.  The  Li  norm  is  used  here — on  a  grid  with 
A''^  points,  the  Li  norm  of  a  vector  v  is 

all  j,j 

Eq.  5.16  is  very  convenient.  It  is  a  way  of  setting  the  coarse-grid  tolerances  in  the 
FMG  procedure  automatically.  Also,  since  the  additional  coarse-grid  term  ^^^^ericaf 
is  already  computed  as  part  of  the  coefficient  computations  precediing  the  coarse-grid 
smoothing,  there  are  no  new  quantities  to  be  computed  and  monitored. 

5.2.2     Numerical  Characteristics  of  the  FMG  Procedure 

The  following  issues  are  addressed:  the  validity/utility  of  the  analysis  above  lead- 
ing to  Eq.  5.16,  the  performance  of  the  resulting  FMG  procedure  based  on  the  trun- 
cation error  convergence  criterion  in  terms  of  the  cost  and  the  initial  residual  level 
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on  the  fine  grid,  and  the  characteristics  of  the  convergence  path  through  the  FMG 
cycling  as  a  function  of  the  flow  problem  and  the  coarse-grid  convection  scheme. 

Two  flow  problems  with  very  different  physical  characteristics  are  considered,  the 
lid-driven  cavity  flow  at  Reynolds  number  5000  and  a  symmetric  backward-facing  step 
flow  at  Reynolds  number  300.  Streamlines,  velocity,  vorticity  and  pressure  contours 
for  the  two  model  flow  problems  are  shown  in  Figures  5.2  and  5.3,  to  clarify  the 
problem  specification  and  bring  out  their  difi^erent  physical  features.  In  the  streamline 
plots,  the  contours  both  inside  and  outside  the  recirculation  regions  are  spaced  evenly. 
However,  because  the  recirculation  regions  are  fairly  weak  in  both  problems,  the 
spacing  between  contour  levels  is  set  to  be  smaller  within  the  recirculation  regions  in 
order  to  bring  out  the  flow  pattern. 

The  lid-driven  cavity  flow  is  a  recirculating  flow  where  convection  and  cross- 
stream  diffusion  balance  each  other  in  most  of  the  domain  and  the  pressure  gradient  is 
important  only  in  the  upper-left  corner.  In  contrast,  the  symmetric  backward-facing 
step  flow  is  aligned  with  the  grid  for  much  of  the  domain.  The  pressure  gradient 
balances  viscous  diffusion  as  in  channel-type  flows.  These  problems  are  challenging 
in  different  ways  and  are  representative  of  much  broader  cross-sections  of  interesting 
flow  situations. 

Figures  5.4-5.7  show  the  convergence  path  of  the  u-momentum  residual  in  the 
lid-driven  cavity  flow  for  different  coarse-grid  convergence  criteria.  The  residual  is 
plotted  for  the  current  outermost  level,  during  the  FMG  procedure.  Also  the  plot  is 
continued  for  the  first  three  multigrid  cycles  on  the  finest  grid  level  to  show  the  initial 
multigrid  convergence  rate  on  the  fine  grid.  The  finest  grid  level  was  321  x  321  and 
seven  multigrid  levels  were  used — the  coarsest  grid  is  6  x  6.  The  defect-correction 
approach  was  used,  first-order  upwinding  on  coarse  grids  and  defect-correction  on  the 
finest  level.  V(3,2)  cycles  with  bilinear  interpolation  for  the  prolongation  procedure 
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and  restriction  procedure  3,  piecewise-constant  summation  of  the  residuals  only,  were 
used.  The  relaxation  factors  were  Uuv  =  0.5  and  cjp  =  0.5,  and  point-Jacobi  inner 
iterations  were  used,  with  Uu  -  i^v  =  3,  and  f^  =  9.  In  the  symmetric  backward-facing 
step  results  given  below,  the  same  procedures  are  used,  except  in  the  smoother  the 
relaxation  factors  are  oj^^  =  0.6  and  Up  =  0.4.  The  fine  grid  is  321  x  81  and  five 
multigrid  levels  are  used. 

In  Figure  5.4,  the  truncation  error  criterion  Eq.  5.16  is  applied,  with  the  denom- 
inator set  to  1.  This  is  the  "right"  denominator  according  to  the  analysis  behind 
Eq.  5.16,  since  the  outermost  levels  during  the  FMG  cycling  on  coarse  grids  are  first- 
order  accurate  in  the  convection  term,  provided  convection  is  important  in  the  flow 
problem.  The  tolerances  given  by  the  truncation  error  criterion  are  graded,  because 
the  truncation  error  is  larger  on  coarser  grids.  The  spacing  between  the  levels  is 
uneven,  though,  and  depends  on  the  evolving  solution.  For  the  cavity  flow,  ||r''||  in 
Eq.  5.16,  with  the  denominator  equal  to  1,  converged  to  -1-0.2,  -0.4,  -1.2,  -1.9  and  -2.6 
for  levels  2  through  6.  On  the  finest  grid  the  truncation  error  estimate  converges  to 
-3.0. 

The  figure  shows  a  jump  in  the  residual  level  going  from  coarse  grid  to  fine  grid 
of  approximately  -0.6  between  any  two  successive  levels.  This  jump  is  just  logiol/4. 
Physically  the  equation  residuals  represent  integrated  quantities  in  the  finite-volume 
discretization.  Thus,  whether  on  the  coarse  or  the  fine  grid,  the  net  residual  (Li 
norm)  should  be  roughly  the  same  (or  greater,  because  the  bilinear/biquadratic  in- 
terpolations considered  here  should  not  be  expected  to  improve  the  solution  since  they 
are  not  derived  from  the  physics).  In  the  norm  used  here  the  sum  of  the  residuals 
is  divided  by  the  number  of  grid  points.  Thus,  in  the  best  case  one  would  antici- 
pate the  result  which  has  been  obtained,  with  the  factor  of  4  decrease  in  the  average 
residual — the  fine-grid  control  volumes  are  a  factor  of  4  smaller  than  the  coarse-grid 
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control  volumes.  The  fact  that  the  maximum  jump  is  achieved  indicates  that  the 
order  of  the  prolongation  procedure  is  sufficient  for  the  flow  problem.  In  Figure  5.8 
the  corresponding  case  for  the  symmetric  backward-facing  step  flow  is  shown.  The 
jump  in  the  average  u  residual  between  levels  is  about  -0.4.  Similar  observations 
hold  for  second-order  upwinding  in  both  flow  problems,  using  the  truncation  error 
criterion  with  the  denominator  set  equal  to  three.  Thus,  the  results  obtained  are 
plausible  and  one  would  expect,  about  the  best  results  which  are  possible. 

Figure  5.5  shows  the  effect  of  applying  a  more  stringent  coarse-grid  convergence 
criterion.  In  this  case  the  truncation  error  estimate  is  again  used  but  with  the  de- 
nominator set  to  five.  A  slight  improvement  in  the  initial  level  of  the  residual  on  the 
finest  grid  is  obtained.  After  1  fine-grid  cycle,  the  residual  is  -3.5  compared  to  -3.25 
for  the  1-FMG  cycle.  However,  tightening  the  coarse-grid  tolerances  even  further 
does  not  give  any  benefit.  For  example,  Figure  5.6  shows  the  FMG  convergence  path 
when  the  coarse-grid  residual  is  driven  down  to  a  specified  value  on  each  level,  i.e. 
when 


„/ii 


r" 


<e  (5.18) 


is  applied,  with  e  =  -3.0  in  Figure  5.6.  Also,  in  the  subsequent  figure,  Figure  5.7, 
the  FMG  convergence  path  is  shown  for  a  "graded"  set  of  tolerances.  Specifically,  for 
the  7-level  cavity  flow  levels  2  through  6  were  converged  to  -0.7,  -1.3,  -1.9,  -2.5  and 
-3.1,  respectively  (factor  of  1/4  reduction  per  level).  These  particular  values  are  all 
equal  to  -2.4  if  instead  of  Eq.  5.17,  the  residual  is  normed  according  to 


Vr 


M  =   E  77^^  (5.19) 

77^    flux 

all  I, J  •> 

where  flux  is  a  characteristic  momentum  flux,  equal  to  the  Reynolds  number  in  the 
present  flow  problem.  Shyy  and  Sun  [74]  used  this  approach.  The  tolerance  on  level  6, 
-3.1,  was  chosen  a  posteriori  to  match  the  known  initial  level  of  the  fine-grid  residual. 
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The  graded  set  of  coarse-grid  tolerances  are  representative  of  a  "best  possible  guess" 
that  one  could  make  without  prior  experience. 

From  these  figures,  there  does  not  appear  to  be  any  benefit  in  converging  the 
coarse  grids  to  tighter  tolerances.  Furthermore,  there  is  the  disadvantage  that  tighter 
coarse-grid  tolerances  require  more  coarse-grid  cycles  and  are  therefore  more  expen- 
sive in  terms  of  work  units  and  especially  in  terms  of  run  time  on  the  CM-5  (the 
bottom  plot).  The  graded  tolerances  work  almost  as  well  as  the  truncation  error 
criterion,  except  that  there  are  a  few  unnecessary  cycles  on  levels  2  and  3. 

The  tradeoff  between  the  run  time  elapsed  during  the  FMG  procedure  on  serial 
and  parallel  computers,  and  the  initial  level  of  the  u  residual,  is  summarized  in 
Table  5.1. 


Coarse-grid 

Number  of  V(3,2) 

FMG 

FMG  CM-5 

Initial  level 

tolerances 

cycles  on  levels 
{1...6} 

work  units 

busy  time 

of  fine-grid 
U  residual 

T.E.  w/denom.  =  1 

{x    1     1     1     1     1} 

2.2 

2.3  s 

-3.25 

T.E.  w/denom.  =  5 

{x    2     3     6     6     5} 

11.5 

10.9  s 

-3.54 

-3.0  on  all  levels 

{x  15  17  14  10    3} 

11.1 

20.2  s 

-3.46 

-5.0  on  all  levels 

{x  24  27  26  30  32} 

69.3 

64.1  s 

-3.57 

Graded  tolerances 

{x    6     8     8     6     5} 

11.9 

13.0  s 

-3.54 

Table  5.1.  Comparison  between  different  sets  of  coarse-grid  tolerances  in  terms  of  the 
effort  expended  in  the  FMG  procedure,  for  the  Re  =  5000  lid-driven  cavity  flow  and 
the  bilinear  interpolation  prolongation  procedure.  The  defect-correction  stabilization 
strategy  is  used. 


To  judge  which  case  is  the  "best,"  one  asks  how  many  work  units  or  how  much 
cpu  time  is  required  to  reach  a  given  level  of  the  residual.  A  few  fine-grid  cycles  are 
required  to  make  up  the  difference  in  the  initial  levels  of  the  fine-grid  residual.  These 
are  charged  at  a  rate  of  slightly  more  than  6.25  work  units  per  V(3,2)  cycle  for  this 
7-level  problem  with  the  321  x  321  fine  grid,  equivalent  to  about  1.5  seconds  on  a 
128-VU  CM-5.  Thus,  the  "1-FMG"  procedure  (the  first  row)  is  judged  to  be  the 
most  efficient. 
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Evidently,  the  cavity  flow  problem  is  relatively  benign  in  terms  of  the  convection 
eff"ect  on  the  convergence  rate  characteristics.  The  truncation  error  estimate  is  im- 
mediately satisfied  on  each  of  the  coarse  grids  in  the  7-level  computation  after  only 
1  V(3,2)  cycle.  Even  less  smoothing  is  possible  for  this  problem,  even  though  the 
Reynolds  number  is  high.  Table  5.2  clarifies  the  role  of  the  FMG  procedure  in  this 
flow  problem. 


Number  of  V(3,2) 

FMG 

FMG  CM-5 

Initial  level 

cycles  on  levels 

work  units 

busy  time 

of  fine-grid 

{1...6} 

U  residual 

{000000} 

—  diverges  — 

{x  0  0  0  0  0} 

{x  1  0  0  0  0} 

0.006 

0.22  s 

-2.44 

{x  1  1  0  0  0} 

0.031 

0.50  s 

-2.73 

{x  1  1  1  0  0} 

0.135 

0.88  s 

-3.03 

{x  1  1  1  1  0} 

0.550 

1.42  s 

-3.16 

{x  1  1  1  1  1} 

2.216 

2.25  s 

-3.25 

Table  5.2.  Accuracy/effort  tradeoff  between  a  "1-FMG'^  approach  (7th  row),  and 
simple  V  cycling  with  a  zero  initial  guess  on  the  fine  grid  (1st  row).  An  approximate 
solution  must  be  obtained  on  at  least  level  2  in  order  to  avoid  divergence  for  the 
7-level  Re  =  5000  lid-driven  cavity  flow  problem,  when  the  relaxation  factors  are 
Uuv  =  ^'c  =  0.5.  "FMG  work  units"  refers  to  the  work  units  (proportional  to  a  serial 
computer's  run  time)  already  expended  at  the  point  when  multigrid  cycling  on  the 
finest  grid  level  begins.  "CM-5  busy  time"  is  the  corresponding  measure  of  work  on 
a  128-VU  CM-5,  in  seconds.  The  "x"  in  the  column  corresponding  to  level  1  means 
that  2  SIMPLE  iterations  were  done  on  the  coarsest  grid.  These  data  are  for  the 
defect- correction  strategy. 


Thus,  it  is  possible  to  prolong  the  solution  directly  from  level  3,  a  21  x  21  grid, 
to  the  fine  grid.  However,  for  the  relaxation  factors  used,  an  initial  guess  on  an  even 
coarser  grid  (level  1  or  2)  is  not  accurate  enough  to  prevent  the  fine-grid  V(3,2)  cycles 
from  diverging. 

The  results  in  Figures  5.6-5.7  showed  that  the  initial  residual  on  the  fine  grid  was 
independent  of  the  degree  of  accuracy  obtained  on  the  coarser  grid  levels.    Closer 
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examination  shows  that  the  initial  residual  levels  on  the  coarse  grid  levels  during 
the  FMG  procedure  also  do  not  appear  to  depend  on  the  degree  to  which  the  next 
coarser  grid  level  is  converged.  Furthermore  this  observation  holds  for  second-order 
upwinding  on  all  levels  in  the  cavity  flow,  and  for  either  defect-correction  or  second- 
order  upwinding  in  the  symmetric  backward-facing  step  flow.  The  FMG  convergence 
paths  for  the  step  flow,  using  second-order  upwinding  on  ail  grid  levels,  are  shown  in 
Figures  5.9-5.12. 

There  appears  to  be  a  certain  maximum  amount  of  accuracy  that  can  be  car- 
ried over  to  the  next  finer  grid  with  the  bilinear  interpolation  prolongation.  Since 
the  truncation  error  convergence  criterion  does  not  exceed  this  amount  of  accuracy, 
and  indeed  the  average  residuals  levels  are  virtually  the  same  if  the  denominator  in 
Eq.  5.16  is  set  to  five,  the  results  strongly  suggest  that  the  degree  of  accuracy  on  a 
given  coarse  grid  which  is  exploitable  is  related  to  the  differential  error  in  the  solu- 
tion, i.e.  the  truncation  error,  and  not  the  algebraic  error.  Thus,  the  results  support 
the  arguments  made  in  the  paragraph  following  Eq.  5.4. 

With  regard  to  the  performance  of  the  truncation  error  criterion,  the  defect- 
correction  and  second-order  upwind  stabilization  strategies  showed  similar  results,  in 
both  flow  problems.  The  initial  fine-grid  residual  level  and  the  stability  of  the  subse- 
quent multigrid  iterations,  however,  appear  to  be  strongly  dependent  on  the  convec- 
tion schemes  used.  Table  5.3  summarizes  the  FMG  convergence  rates  for  second-order 
upwinding  in  the  lid-driven  cavity  flow. 

The  -3.0  and  graded  tolerance  cases  both  converged  with  the  defect-correction 
scheme,  but  with  second-order  upwinding  they  diverge.  After  several  fine-grid  cycles 
the  -2.0  case  diverges  also.  The  diff'erence  between  the  cases  is  evident — many  more 
coarse-grid  cycles  are  taken  in  the  cases  which  diverge.    The  source  terms  in  the 
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Coarse-grid 

Number  of  V(3,2) 

FMG 

FMG  CM-5 

Initial  level 

tolerances 

cycles  on  levels 
{1...6} 

work  units 

busy  time 

of  fine-grid 
U  residual 

T.E.  w/denom.  =  1 

{x    1     2     2    2     5} 

9.4 

8.8  s 

-2.88 

T.E.  w/denom.  =  5 

{x   4    14    7    16  18} 

37.7 

40.0  s 

-3.50 

-2.0  on  all  levels 

{x  35  22  19   6     1} 

6.9 

28.6  s 

-3.17 

-3.0  on  all  levels 

{x  45  34  74  35  oo} 

diverges 

Graded  tolerances 

{x  23  14  19  20  oo} 

diverges 

Table  5.3.  Comparison  between  different  sets  of  coarse-grid  tolerances  in  terms  of  the 
effort  expended  in  the  FMG  procedure,  for  the  Re  =  5000  lid-driven  cavity  flow  and  the 
bilinear  interpolation  prolongation  procedure.  The  second-order  upwind  stabilization 
strategy  is  used. 

second-order  upwind  discretization  appear  to  be  a  strong  destabilizing  factor  in  this 
flow  problem. 

Furthermore,  the  fact  that  the  -2.0  constant  tolerance  at  least  reaches  the  fine 
grid  while  the  constant  -3.0  tolerance  diverges  suggests  that  the  amount  of  mismatch 
between  the  ending  coarse-grid  residual  level  and  the  beginning  fine-grid  residual 
level,  which  is  greater  for  the  -3.0  case  than  the  -2.0  case,  is  related  to  the  size  of  the 
destabilizing  source  terms  in  the  initial  fine-grid  problem.  Thus  in  addition  to  being 
wasteful  of  work  units  and/or  cpu  time,  obtaining  excessive  accuracy  on  the  coarse 
grids  can  actually  be  detrimental  to  the  stabihty  of  multigrid  iterations,  depending  on 
the  discretization  scheme.  Evidently  with  relaxation  factors  u^y  =  lUc  =  0.5,  second- 
order  upwinding,  with  V(3,2)  cycles  and  !/„  =  f ^  =  3  and  Uc  =  ^  inner  point-Jacobi 
inner  iterations  in  each  SIMPLE  outer  iteration,  the  Re  -  5000  lid-driven  cavity 
flow  is  diflttcult  to  solve.  The  multigrid  iterations  only  converge  for  a  relatively  small 
range  of  coarse-grid  tolerances.  This  range  may  be  hard  to  find  by  trial  and  error. 
The  truncation  error  criterion  is  useful  in  this  regard. 

Similar  observations  are  made  for  the  symmetric  backward-facing  step  flow.  Fig- 
ures 5.9-5.12  are  the  corresponding  results  for  the  Re  =  300  symmetric  backward- 
facing  step  flow,  using  second-order  upwinding  on  all  coarse  grid  levels  in  the  FMG 
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procedure.  The  convergence  rate  behavior  of  the  second-order  upwind  scheme  in  the 
step  flow  is  similar  to  the  defect-correction  scheme  results  in  the  lid-driven  cavity 
flow.  For  the  symmetric  backward-facing  step  flow,  a  321  x  81  fine  grid  with  5  multi- 
grid  levels  was  used.  The  coarsest  grid  was  21  x  6.  As  in  the  cavity  flow  cases, 
V(3,2)  cycles  were  used  with  bilinear  interpolation  for  the  prolongation  procedure 
and  restriction  procedure  3.  The  relaxation  factors  were  ujuv  =  0.6  and  Uc  =  0.4.  As 
in  previous  cases,  3,  3,  and  9  point-Jacobi  inner  iterations  were  used  in  each  SIMPLE 
iteration  for  the  u,  v,  and  p'  systems  of  equations,  respectively. 

In  Figure  5.9,  the  convergence  path  is  similar  to  the  cavity  flow  convergence 
path — except  that  in  the  cavity  flow,  the  coarse-grid  tolerances  given  by  Eq.  5.15 
were  loose  enough  that  only  one  cycle  was  needed  on  each  of  the  coarse  grids,  yielding 
a  "1-FMG"  cycle.  In  the  symmetric  backward-facing  step  flow,  more  than  one  cycle 
is  needed  on  each  coarse-grid  level  to  satisfy  the  truncation  error  criterion.  The 
truncation  error  estimate  (with  the  denominator  set  equal  to  three  because  the  coarse- 
grid  discretizations  are  second-order)  converges  on  the  following  levels  corresponding 
to  the  grid  levels  2  to  4:  -0.8.  -2.9,  -4.0.  On  the  finest  grid  the  estimated  level  is  -4.9. 

Figures  5.10-5.12  show  the  FMG  convergence  path  when  tighter  coarse-grid  toler- 
ances are  used,  and  these  results  are  summarized  in  Table  5.4  below.  For  the  graded 
set  of  coarse  grid  tolerances,  levels  2  through  4  were  converged  to  -3.1,  -3.7,  and  -4.3. 
Each  of  these  levels  corresponds  to  the  level  -2.1  if  the  norm  used  is  Eq.  5.19  instead 
of  the  average  Li  norm. 

As  in  the  cavity  flow  case,  there  is  only  a  small  effect  on  the  initial  solution 
accuracy  on  each  coarse-grid  level.  There  is  no  benefit  to  the  initial  fine-grid  residual 
level  by  converging  the  coarse  grids  to  strict  tolerances.  The  truncation  error  criterion 
with  the  denominator  set  to  5  appears  to  be  the  most  stringent  criterion  which  does 
not  waste  any  coarse-grid  cycles,  i.e.  it  is  nearly  the  optimal  cost/residual  reduction 
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balance.  The  other  approaches  obtain  more  accuracy  on  the  coarse  grids  than  can  be 
carried  over  to  the  initial  fine-grid  solution,  for  the  bilinear  interpolation  prolongation. 


Coarse-grid 

Number  of  V(3,2) 

FMG 

FMG  CM-5 

Initial  level 

tolerances 

cycles  on  levels 
{1...4} 

work  units 

busy  time 

of  fine-grid 
U  residual 

"1-FMG"  cycle 

{x    1     1        1} 

2.2 

1.2  s 

-2.88 

T.E.  w/denom.  =  1 

{x    2     2        3} 

5.9 

3.0  s 

-3.82 

T.E.  w/denom.  =  5 

{x    4     4        4} 

8.6 

4.8  s 

-4.63 

-3.0  on  all  levels 

{x  23    7        1} 

6.5 

8.1  s 

-4.37 

-5.0  on  all  levels 

{x  45  16      10} 

27.0 

21.5  s 

-5.10 

Graded  tolerances 

{x  21    9        5} 

13.8 

10.8  s 

-4.71 

Table  5.4.  Comparison  between  different  sets  of  coarse-grid  tolerances  in  terms  of 
the  effort  expended  in  the  FMG  procedure,  for  the  Re  =  300  symmetric  backward- 
facing  step  flow  and  the  bilinear  interpolation  prolongation  procedure.  Second-order 
upwinding  is  used  on  all  grid  levels. 

The  results  for  the  defect-correction  strategy  are  summarized  in  the  table  below. 
In  the  cavity  flow,  the  second-order  upwind  scheme  was  very  difficult  to  converge 
when  a  constant  or  a  graded  tolerance  was  given.  In  the  step  flow,  it  appears  that 
the  defect-correction  strategy  is  harder  to  converge. 


Coarse-grid 

Number  of  V(3,2) 

FMG 

FMG  CM-5 

Initial  level 

tolerances 

cycles  on  levels 
{1...4} 

work  units 

busy  time 

of  fine-grid 
U  residual 

"1-FMG"  cycle 

{x    1     1         1} 

2.2 

1.0  s 

-2.99 

T.E.  w/denom.  =  1 

{x    2     2        2} 

4.3 

1.9  s 

-3.34 

T.E.  w/denom.  =  5 

{x    5     6        5} 

12.3 

5.1  s 

-4.18 

-3.0  on  all  levels 

{x  22    9        1} 

7.3 

6.9  s 

-4.00 

-5.0  on  all  levels 

{x  32  24      53} 

100.1 

37.8  s 

-4.24 

Graded  tolerances 

{x  21   12      21} 

41.4 

17.2  s 

-4.22 

Table  5.5.  Comparison  between  different  sets  of  coarse-grid  tolerances  in  terms  of  the 
effort  expended  in  the  FMG  procedure,  for  the  Re  =  300  symmetric  backward-facing 
step  flow  and  the  bilinear  interpolation  prolongation  procedure.  The  defect-correction 
stabilization  strategy  is  used. 


5.2.3     Influence  of  Initial  Guess  on  Convergence  Rate 

The  cost/initial  accuracy  tradeoff  was  discussed  above.    In  addition,  the  initial 
guess  on  the  fine  grid  is  important  because  it  can  aflfect  the  asymptotic  convergence 
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rate  and  stability  of  subsequent  fine-grid  cycles.  In  many  cases  this  consideration  is 
more  important  than  the  cost/initial  accuracy  tradeoff,  since  the  time  spent  in  the 
FMG  procedure  may  be  very  small  compared  to  the  overall  time  required  if  many 
fine-grid  cycles  are  needed.  The  FMG  contribution  to  the  total  run  time,  especially 
on  the  CM-5,  is  not  always  negligible,  though,  in  particular  if  one  defines  conver- 
gence according  to  the  truncation  error  estimate  on  the  finest  grid,  i.e.  diff'erential 
convergence,  as  suggested  by  Brandt  and  Ta'asan  [7]. 

Figure  5.13  gives  the  convergence  path  for  the  entire  computation  for  the  lid- 
driven  cavity  flow.  In  the  top  plot,  the  fine-grid  average  u  residual  is  plotted  against 
the  CM-5  busy  time  for  the  defect-correction  scheme.  The  defect-correction  scheme 
and  second-order  upwind  scheme  (bottom  plot)  converge  at  nearly  the  same  rate.  The 
differences  in  the  initial  fine-grid  residual  level  due  to  the  FMG  procedure  evidently 
do  not  persist  for  very  long,  and  if  the  purpose  is  to  obtain  algebraic  convergence, 
Eq.  5.4,  then  the  difi"erence  in  CM-5  busy  time  due  to  the  FMG  procedure  is  insignif- 
icant. However,  if  convergence  is  declared  when  the  average  u  residual  falls  beneath 
the  dotted  line,  the  estimated  truncation  error  level  on  the  fine  grid,  then  the  FMG 
procedure  contributes  anywhere  from  10%  of  the  total  time,  in  the  case  of  the  trun- 
cation error  criterion  with  denominator  1,  to  80%  of  the  total  time,  in  the  case  of  the 
constant  -5.0  criterion. 

For  the  Re  =  5000  lid-driven  cavity  flow,  using  SIMPLE  with  u^  =  k,  -  I  and 
i/c  =  4  inner  SLUR  iterations  and  a  W(l,l)  multigrid  cycle,  Sockol  [80]  reported  that 
86  work  units  and  800  seconds  on  an  Amdahl  5980  were  needed  to  reach  convergence. 
To  reach  a  similar  convergence  tolerance  the  present  computation  needed  200  work 
units  and  64  seconds  on  the  CM-5.  In  the  previous  section,  the  amount  of  smoothing 
used  in  the  present  case,  V(3,2)  cycles,  was  observed  to  be  somewhat  more  than  was 
necessary  for  this  flow  problem.    The  difference  between  V(3,2)  cycles  and  W(l,l) 
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cycles  in  terms  of  work  units  is  approximately  3,  per  cycle.  Thirty  cycles  on  the 
fine  grid  were  taken  in  the  present  case.  Thus,  it  seems  that  the  present  result  is 
comparable  to  Sockol's  result. 

The  fine-grid  convergence  paths  for  the  symmetric  backward-facing  step  flow,  Fig- 
ure 5.14,  are  very  interesting.  The  second-order  upwind  scheme  performs  remarkably 
well.  The  average  u  residual  reaches  -8.0  in  just  slightly  more  than  20  seconds  on 
the  CM-5  and  140  work  units  (20  V(3,2)  cycles  on  the  321  x  81  fine  grid).  This 
convergence  rate  corresponds  to  an  amplification  factor  of  0.6  per  cycle  for  the  Li 
norm  of  the  u-residual.  Because  of  the  fast  convergence  rate  the  contribution  of  the 
startup  FMG  cycling  is  a  significant  fraction  of  the  overall  parallel  run  time. 

The  defect-correction  strategy  does  not  converge  as  quickly  as  the  second-order 
upwind  scheme  in  the  symmetric  backward-facing  step  flow.  Furthermore,  for  the 
defect-correction  scheme,  the  fine-grid  initial  guess  evidently  affects  the  rate  of  con- 
vergence. To  obtain  the  convergence  paths  in  the  top  plot  of  Figure  5.14,  identical 
procedures  and  parameters  were  used  for  the  multigrid  iterations  beginning  on  the 
fine  grid.  The  relaxation  factors  were  lJuv  =  uJc  =  0.5  and  fixed  V(3,2)  cycles  were 
used. 

The  coarse-grid  discretizations  in  the  FMG  procedure  use  first-order  upwinding, 
while  the  fine-grid  discretization  is  modified  to  produce  central-diff"erence  accuracy. 
Thus,  the  sudden  rise  in  the  residual  level  for  all  cases  (except  the  truncation  er- 
ror criterion  with  denominator  equal  to  1)  suggests  that  the  first-order  upwind  and 
central-difference  solutions  to  this  flow  problem  are  very  different.  It  is  apparently 
difficult  for  the  numerical  method  to  evolve  the  solution  from  first-order  upwind  ac- 
curacy into  central-difference  accuracy.  Thus,  there  is  actually  an  advantage  in  not 
converging  the  coarse  grids  to  tight  tolerances.  On  the  other  hand,  the  "1-FMG" 
procedure  has  the  worst  convergence  rate  of  the  cases  considered.    The  conclusion 
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Figure  5.14  supports  is  that  there  is  an  optimal  solution  accuracy  on  the  coarse  grids 
in  the  FMG  procedure,  which  is  related  to  the  differential  error  in  the  solution  since 
the  truncation  error  estimate  gives  the  best  result. 

5.2.4     Remarks 

Both  flow  problems  have  strong  nonlinearities  and  are  relatively  difficult  and 
slow  to  converge  as  single-grid  computations.  The  multigrid  method  allows  larger 
relaxation  parameters  to  be  used.  Very  fast  convergence  rates  can  be  obtained, 
but  the  performance  depends  on  the  discretization  on  coarse  grids  (the  stabilization 
strategy)  and  the  initial  fine-grid  guess.  The  fact  that  the  truncation  error  criterion 
gives  the  best  results  in  both  flow  problems,  and  that  regardless  of  how  tight  the 
coarse  grids  are  converged  both  the  initial  fine  and  coarse-grid  residuals  are  relatively 
independent,  indicates  that  there  is  only  a  certain  amount  of  accuracy  which  can  be 
obtained  initially  for  a  given  flow  problem  and  coarse-grid  discretization  scheme, 
and  that  this  observation  is  essentially  a  reflection  of  the  truncation  error  of  the 
discretization. 

The  second-order  upwind  scheme  may  be  prone  to  large  source  terms  which  can 
cause  the  multigrid  iterations  to  diverge,  especially  if  relatively  few  smoothing  itera- 
tions are  used.  This  observation  was  made  for  the  cavity  flow.  On  the  other  hand, 
when  there  is  a  significant  difference  between  the  first-order  and  central-difference  so- 
lutions on  a  given  grid,  the  success  of  the  defect-correction  strategy  depends  strongly 
on  the  initial  guess  on  the  finest  grid  (re:  the  step  flow  results)  and,  in  this  sense, 
the  defect-correction  approach  is  not  very  robust. 

The  stability  of  multigrid  iterations  is  different  than  for  single-grid  calculations, 
and  certainly  more  confusing.  For  example,  if  a  single-grid  calculation  does  not  con- 
verge at  a  given  Reynolds  number  with  a  certain  set  of  relaxation  parameters,  then 
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reducing  the  relaxation  factors  is  always  convergence-enhancing.  For  multigrid  itera- 
tions this  is  not  necessarily  true.  It  was  observed  for  the  second-order  upwind  scheme 
in  the  Re  =  300  symmetric  backward-facing  step  problem,  that  the  single-grid  method 
diverges  using  uJuv  =  0.3  with  Wc  =  0.2  for  the  Re  =  300  symmetric  backward-facing 
step  flow  and  the  second-order  upwind  scheme.  However,  convergence  was  obtained 
with  Uuv  -  0.6  and  u^  =  0.4.  Evidently,  there  is  a  certain  minimum  amount  of 
smoothing  required.  The  amount  depends  on  the  flow  problem  as  well  as  the  restric- 
tion and  prolongation  procedures.  In  other  words,  reducing  the  relaxation  factors  to 
cope  with  problems  that  have  strong  nonlinearities  may  simultaneously  require  in- 
creasing the  number  of  smoothing  iterations  on  each  level.  The  converse  is  also  true 
although  perhaps  counterintuitive — reducing  the  amount  of  smoothing,  for  example 
from  V(3,2)  to  V(2,l)  cycles,  may  cause  stability  problems.  Increasing  the  relaxation 
factors  is  the  appropriate  response.  By  contrast,  for  single-grid  computations,  if  the 
number  of  inner  iterations  is  too  low,  the  relaxation  factors  are  decreased  to  avoid 
divergence.  Additional  testing  in  the  smoothing/relaxation  factor  parameter  space 
would  be  desirable  to  further  clarify  this  point. 

5.3     Performance  on  the  CM-5 

This  section  quantifies  the  cost  of  multigrid  cycling  on  the  CM-5,  and  discusses 
the  efficiency  and  scalability  of  the  present  algorithm  and  implementation.  In  other 
words,  to  connect  with  the  preceding  section,  once  the  fine-grid  is  reached,  what  is 
the  best  grid  schedule  to  use,  how  long  does  each  cycle  take,  and  how  does  this  cost 
scale  with  the  problem  size  and  the  number  of  processors? 

In  Figure  5.15,  the  costs  of  smoothing  and  prolongation  are  shown  as  a  function 
of  problem  size,  for  a  32-node  CM-5  and  a  512-node  CM-5.  During  a  multigrid 
cycle  these  costs  are  incurred  for  each  grid  level.   In  a  V(3,2)  cycle,  for  example,  5 
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SIMPLE  iterations  are  done  at  every  grid  level,  along  with  one  restriction  from  and 
one  prolongation  to  every  grid  level  except  the  coarsest.  If  the  finest  grid  is  770  x  770, 
then  on  a  32-node  CM-5  the  subgrid  size  {VP)  is  roughly  4800.  The  next  (coarser) 
grid  is  385  x  385  and  has  a  subgrid  size  of  1225.  Thus  in  a  two-level  V(3,2)  cycle, 
the  total  time  is  the  sum  of  5  SIMPLE  iterations  at  VP  =  4800,  one  restriction  from 
VP  =  4800  to  VP  =  1225,  5  SIMPLE  iterations  at  VP  =  1225,  and  one  prolongation 
from  VP  =  1225  to  VP  =  4800.  Thus,  Figure  5.15  is  a  level-by-level  breakdown  of 
the  parallel  run  time  used  by  the  smoothing  and  prolongation  multigrid  components. 
The  times  plotted  are  total  elapsed  times  including  the  processor  idle  time  due  to 
front-end  work. 

The  smoothing  cost  dominates  the  cost  of  the  prolongation,  at  every  VP.  Thus 
unless  a  multigrid  cycle  with  less  smoothing  is  used,  the  common  idealization  that  the 
restriction  and  prolongation  costs  are  negligible  on  serial  computers  also  holds  true  on 
the  CM-5.  The  restriction  cost  has  not  been  shown  in  order  to  keep  the  figure  clear. 
It  follows  the  same  trend  as  prolongation  only  slightly  less  time-consuming  if  the 
residuals  are  alone  are  restricted  (about  25%  less),  and  slightly  more  time-consuming 
if  both  solutions  and  residuals  are  restricted. 

The  trend  is  linear  for  both  restriction,  prolongation  and  smoothing.  When  resid- 
uals only  are  restricted,  the  ratio  of  the  times  for  these  three  components  tends 
toward  1:2:13,  on  the  32-node  CM-5,  as  the  number  of  grid  points  increases  (i.e.  as 
the  subgrid  size  increases). 

However,  for  the  512-node  CM-5,  the  time  taken  by  prolongation  grows  at  a 
slightly  greater  rate  than  on  the  32-node  computer.  On  the  512-node  CM-5,  VP  = 
4800  corresponds  to  a  3080  x  3080  grid  size,  instead  of  770  x  770  as  was  the  case 
with  the  32-node  CM-5.    Apparently,  the  global  communication  patterns  needed  to 
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accomplish  the  prolongation  are  not  perfectly  scalable  on  the  fat-tree,  at  least  with 
the  current  CM-Fortran  implementation. 

Figure  5.15  gives  the  impression  that  the  cost  of  SIMPLE  iterations  varies  linearly 
with  VP.  However,  as  shown  in  Figure  5.16,  the  variation  is  not  actually  linear  for 
very  small  VP.  The  bar  on  the  left  is  the  CM-5  busy  time  for  5  SIMPLE  iterations, 
given  as  a  function  of  the  grid  level.  The  bar  on  the  right  is  the  corresponding  CM-5 
elapsed  time,  taken  from  data  points  along  the  smoothing  cost  curve  in  Figure  5.15. 
The  busy  time  records  the  time  spent  doing  parallel  computation  and  interprocessor 
communication  operations.  These  operations  are  very  inefficient  at  small  VP  on  the 
CM-5  because  the  vector  units  are  not  fully  loaded.  Thus,  the  busy  time  does  not 
scale  linearly  with  the  subgrid  size  for  small  VP  because  the  efficiency  of  vectorized 
computation  and  interprocessor  communication  increases  as  the  subgrid  size  grows. 
Note  however  that  the  busy  time  is  always  a  monotonic  function  of  VP. 

The  variation  of  elapsed  time  by  contrast  stays  approximately  constant  until  level 
5  of  this  sample  multigrid  cycle.  Level  5  corresponds  to  VP  =  36  on  the  32-node 
CM-5.  The  elapsed  time  includes  the  idle-processor  time  due  to  front-end  work.  As 
discussed  in  Chapter  3,  there  are  several  overhead  costs  of  parallel  computation  and 
interprocessor  communication.  These  operations  may  leave  the  CM-5  vector  units 
inactive  for  short  periods  of  time.  For  small  VP  the  dominant  consideration  in  this 
regard  is  the  passing  of  code  blocks,  i.e.  the  front-end-to-processor  communication. 
This  cost  stays  constant  with  VP,  as  shown  for  small  the  elapsed  time  at  small  VP 
in  Figure  5.16.  The  elapsed  time  is  actually  larger  for  VP  =  1  than  VP  =  2.  This 
observation  is  reproducible  but  its  cause  is  not  fully  understood.  Inaccurate  timings 
may  be  the  problem.  A  computer  with  a  relatively  fast  front-end  and  communication 
network  performs  closer  to  the  ideal  for  small  VP. 
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Since  the  cost  of  smoothing  on  the  coarse  grids  does  not  go  to  zero  as  VP  —^  0,  the 
possibihty  exists  for  coarse  grids  to  make  a  nonneghble  contribution  to  the  parallel  run 
time,  if  the  cycling  scheme  is  such  that  the  coarse  grids  are  visited  more  frequently 
than  the  fine  grids.  Figures  5.17-5.18  illustrate  this  point  clearly.  The  cost  per 
multigrid  cycle  is  compared  between  V  and  W  cychng  strategies.  Specifically,  V(3,2) 
cycles  are  compared  against  W(3,2)  cycles.  The  timings  are  obtained  on  a  32-node 
CM-5.  The  number  of  levels  is  fixed  as  the  finest  grid  dimensions  increase.  Both 
elapsed  and  busy  times  are  plotted. 

The  total  time  per  cycle  includes  the  cost  of  smoothing  on  the  grid  levels  in- 
volved, the  restriction  and  prolongation  costs,  and  the  cost  of  program  control  and 
input/output.  For  a  V  cycle,  this  time  can  be  modelled  as 

V  cycle        ^._j  f,_2 

where  Sk,  r^,  and  pk  and  the  smoothing  time  per  iteration  on  level  k  (from  Fig- 
ure 5.15).  the  restriction  time  from  level  fc,  and  the  prolongation  time  to  level  k. 
The  number  of  levels  is  nievei  and  ripre  and  Upost  represent  the  number  of  pre  and 
post-smoothing  iterations,  in  this  case  3  and  2,  respectively.  In  contrast,  W  cycles 
visit  the  coarse  grids  much  more  frequently.  Their  time  per  cycle  can  be  modelled 

^i!^  =  "£'  s,(npre  +  npo.,)2<"'-='-'''  +  EV,  +  p,)2^-'---'K  (5.21) 

W  cycle        ^j  ^^2 

These  expressions  are  valid  for  serial  computations,  too.  On  serial  computers,  the 
restriction  and  prolongation  costs  are  generally  neghgible,  and  the  smoothing  cost 
per  level  s^.  is  basically  a  factor  of  1/4  smaller  for  the  lower  (coarser)  grid  levels.  For 
parallel  computation  on  the  CM-5,  the  fact  that  .s^:  remains  approximately  constant 
for  the  coarsest  grids  is  a  problem  when  many  multigrid  levels  are  used. 

When  only  three  levels  are  involved,  there  is  very  little  disadvantage  to  using  W 
cycles,  as  shown  in  Figure  5.17.  Since  it  is  usually  possible  to  gain  some  benefit  to  the 
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convergence  rate  by  more  frequent  coarse-grid  corrections,  W  cycles  are  recommended 
on  the  CM-5  if  the  number  of  multigrid  levels  is  small.  However,  for  5  or  more  levels, 
Figures  5.18  and  5.19,  W  cycles  begin  to  cost  more  than  they  are  worth  in  terms  of 
improved  convergence  rates.  Also,  since  there  is  a  greater  difference  between  V  and 
W  cycle  elapsed  and  busy  times  as  more  multigrid  levels  are  added,  reflecting  the 
relatively  larger  idle  times  for  coarse  grids  (recall  Figure  5.16),  the  parallel  efficiency 
of  W  cycles  is  less  than  that  of  V  cycles. 

In  the  present  work  V  cycles  have  been  sufficient  to  achieve  good  convergence 
rates  so  no  comparisons  have  been  made  to  W  cycles.  Such  studies  need  to  be  made, 
but  on  a  problem-by-problem  basis.  For  the  symmetric  backward-facing  step  flow 
and  lid-driven  cavity  flow,  it  is  not  expected  that  W  cycles  will  be  advantageous. 

In  many  cases  it  is  acceptable  and  even  beneficial  to  use  less  than  the  full  comple- 
ment of  multigrid  levels,  i.e.  to  increase  the  problem  size  keeping  the  number  of  levels 
fixed.  Whether  or  not  the  computation  is  for  a  physically  time-dependent  flow  prob- 
lem, there  exists  an  implied  time-step  in  iterative  numerical  techniques.  In  multigrid 
computations,  the  changes  in  the  evolving  solution  on  coarser  grid  levels  are  smaller, 
reflecting  the  fact  that  the  physical  or  pseudo-physical  development  of  the  solution 
on  the  fine-grid  is  occurring  on  a  much  smaller  scale.  Thus,  the  coarsest  grid  levels 
may  be  truncated  without  deteriorating  the  convergence  rate.  Pressure  needs  to  be 
treated  globally,  but  usually  there  are  enough  multigrid  cycles  taken  to  ensure  that 
slow  development  of  the  pressure  field  is  not  a  problem,  even  when  the  coarsest  grid 
level  is  not  very  coarse. 

Figures  5.20-5.22  integrate  the  information  contained  in  the  preceding  figures.  In 
Figure  5.20,  the  variation  of  parallel  efficiency  of  7-level  V(3,2)  cycles  with  problem 
size  is  summarized.  The  problem  size  is  the  virtual  processor  ratio  VP  of  the  finest 
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grid  level,  but  of  course  during  the  multigrid  cycle  operations  are  being  done  on 
coarser  grids,  too,  where  VP  is  smaller. 

Figure  5.20  is  similar  to  Figure  3.2  obtained  using  the  single-grid  pressure-correction 
algorithm.  For  small  VP  the  useful  work  (the  computation)  is  dominated  by  the 
interprocessor  and  front-end-to-processor  communication,  resulting  in  low  parallel 
efficiencies.  The  efficiency  rises  as  the  time  spent  in  computation  increases  relative 
to  the  overhead  costs.  The  highest  efficiency  obtained  is  almost  0.65  compared  to  0.8 
for  the  single-grid  method  on  the  CM-5.  The  burden  of  additional  program  control, 
relatively  more  expensive  coarse-grid  smoothing,  and  the  restriction  and  prolongation 
tasks,  adds  up  to  0.15  in  terms  of  the  parallel  efficiency. 

Unlike  the  single-grid  case  however,  the  efficiency  does  not  peak  for  large  problem 
sizes.  The  contributions  from  the  less-efficient  coarser  grids  in  a  multigrid  cycle  on  the 
CM-5  are  significant,  even  when  the  finest  grid  has  VP  ~  8k.  The  range  of  subgrid 
sizes  comprising  a  7-level  multigrid  cycle  scale  (a  realistic  cycle)  span  three  orders 
of  magnitude.  Unfortunately,  the  range  of  VP  in  which  the  multigrid  smoother 
achieves  high  parallel  efficiencies  is  not  as  broad.  In  this  regard  the  performance 
of  the  multigrid  method  on  the  MasPar-style  of  SIMD  computers  is  expected  to 
be  much  better  since  the  single-grid  method  achieved  high  parallel  efficiencies  for 
VP  >  32  all  the  way  up  to  the  largest  problem  size.  Numerical  experiments  have 
not  been  conducted  to  study  the  multigrid  method  on  MasPar  SIMD  computers, 
however,  because  their  Fortran  compiler  is  not  yet  sufficiently  developed  to  address 
the  storage  problem. 

The  efficiency  in  5.20  apparently  has  a  small  dependence  on  the  number  of  pro- 
cessors. This  dependence  is  clearly  shown  in  the  next  figure.  Figure  5.21.  The  de- 
pendence is  due  to  the  slightly  increased  time  spent  in  intergrid  transfer  operations 
with  increasing  iip,  observed  earlier  in  Figure  5.15.  Figure  5.21  shows  the  decrease  in 
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efficiency  with  increasing  number  of  processors  for  five  different  subgrid  sizes.  Again 
recall  that  the  subgrid  size  is  for  the  finest  grid  but  that  much  coarser  grids  are  in- 
volved in  the  7-level  V(3,2)  cycles.  The  figure  indicates  that  the  rate  of  decrease  in 
efficiency  is  the  same  for  every  VP  down  to  at  least  VP  =  320. 

The  dashed  lines  are  linear  least-squares  curve  fits  to  the  data.  The  data  points 
are  perturbed  about  these  lines  due  to  variations  in  the  elapsed  parallel  run  time  Tp. 
Tp  varies  slightly  from  timing  to  timing  depending  on  the  workload  of  the  front-end 
machine.  In  all  cases  multiple  timings  were  obtained  as  a  check  on  the  reproducibility. 
In  light  front-end  loadings  (i.e.  the  middle  of  the  night),  the  measured  Tp  did  not  vary 
more  than  +/-20%. 

Figure  .5.22  is  combines  the  information  contained  in  Figures  5.20  and  5.21.  As 
in  the  single-grid  case,  Figure  3.6,  curves  of  constant  efficiency  are  drawn  on  a  plot 
of  problem  size  versus  the  number  of  processors.  The  curves  are  constructed  by 
interpolating  in  Figure  5.21,  using  the  dashed  lines  as  the  data  instead  of  the  actual 
data  points,  to  determine  VP  at  a  given  (E,np)  intersection.  A'^  is  computed  from 
the  definition  of  VP,  i.e.  N  =  UpVP. 

The  isoefficiency  curves  are  almost  linear  or,  in  other  words,  the  7-level  multigrid 
algorithm  analyzed  on  a  per-cycle  basis,  is  almost  scalable.  Each  of  the  isoefficiency 
curves  can  be  accommodated  by  an  expression  of  the  form 

A^  -  No  =  constant  {np  -  32)",  (5.22) 

with  a  ~  1.1.  The  symbol  No  is  the  initial  problem  size  needed  to  obtain  a  particular 
E  on  32  processors. 

Along  the  isoefficiency  curves,  "scaled-speedup''  [35]  is  nearly  achieved.  If  the 
parallel  run  time  Tp  at  the  initial  problem  size  is  acceptable,  then  it  can  be  maintained 
with  the  present  pressure-based  multigrid  method  as  the  problem  size  and  the  number 
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of  processors  are  increased  in  proportion.  The  inner  iterations  must  be  point-Jacobi 
of  course,  since  the  line-iterative  method  is  0(A'^log2  A'^).  With  the  line-iterative 
method  Tp  increases  slightly  along  the  isoefficiency  curves.  The  scalability  should 
be  nearly  the  same  though,  since  nearest-neighbor  communications  dominate  in  the 
cyclic  reduction  parallel  algorithm  due  to  data-mapping  used  on  the  CM-5. 

5.4     Concluding  Remarks 

A  parallel  multigrid  algorithm  has  been  formulated  and  implemented  on  the  CM- 
5.  The  focus  of  numerical  experiments  and  timings  has  been  on  the  potential  of 
this  approach  for  the  purpose  of  achieving  scalable  parallel  computing  techniques  for 
application  to  the  incompressible  Navier-Stokes  equations. 

The  results  obtained  indicate  that  the  efficiency  of  the  parallel  implementation 
of  the  nonlinear  pressure-based  multigrid  method  approaches  0.65  for  large  problem 
sizes,  and  is  almost  linearly  scalable  on  the  CM-5.  The  cost  per  V(3,2)  cycle  is  about 
1.5  s  on  a  128- vector  unit  CM-5  for  a  7-level  problem  with  a  321  x  321  fine  grid.  The 
cost  per  iteration  is  dominated  by  the  smoothing  cost,  and  thus  much  attention  has 
been  given  to  the  details  of  the  implementation  and  performance  of  the  single-grid 
pressure-based  method  on  SIMD  computers.  Restriction  and  prolongation  are  almost 
negligible,  although  they  are  responsible  for  the  deviation  from  linear  computational 
scalability  observed  in  Figure  5.22.  Very  large  problem  sizes  can  be  handled  on  the 
CM-5,  up  to  3074  X  3074  on  a  32-node  machine,  provided  the  storage  problem  for 
Fortran  multigrid  implementations  can  be  resolved. 

The  speed  of  the  multigrid  code  was  not  assessed  directly,  but  reasonable  estimates 
can  be  made  based  on  the  single-grid  performance.  For  the  single-grid  SIMPLE 
method  using  the  point-Jacobi  solver,  417  MFlops  was  achieved  on  a  32-node  (128 
VU)  CM-5.  Since  the  multigrid  cost  per  7-level  cycle  is  dominated  by  the  smoothing 
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costs  and  the  multigrid  efficiency  is  0.65  compared  to  0.8  (about  a  20%  decrease),  the 
speed  is  roughly  333  MFlops.  SHghtly  improved  efficiency  and  speed  can  be  obtained 
with  fewer  multigrid  levels.  For  unsteady  flow  calculations  multigrid  cycles  with  a 
small  number  of  levels  may  perform  reasonably  well.  This  should  be  investigated. 

Several  practical  recommendations  have  been  made  regarding  multigrid  tech- 
niques for  parallel  computation.  V  cycles  should  be  used  unless  the  number  of 
multigrid  levels  is  small.  W  cycles  are  too  expensive  because  due  to  the  nonneg- 
ligible  coarse-grid  smoothing  costs.  The  FMG  procedure  should  be  controlled  by 
the  truncation  error  estimate  Eq.  5.16.  The  FMG  procedure  can  affect  not  only 
the  time  needed  to  reach  the  fine  grid,  but  also  the  asymptotic  convergence  rate 
and  stability  of  multigrid  iterations  can  be  affected  as  well,  as  evident  from  Fig- 
ure 5.14.  This  observation  may  not  carry  over  to  the  the  locally-coupled  explicit 
smoother.  It  should  be  tested  in  the  same  way.  In  terms  of  computational  efficiency 
the  locally-coupled  explicit  method  has  nearly  the  same  properties  on  the  CM-5  as 
the  pressure-correction  method,  although  the  influence  on  the  cost  per  iteration  and 
efficiency  from  the  coefficient  computations  is  greater. 

Several  algorithmic  factors  have  been  studied,  in  particular  the  coarse-grid  dis- 
cretization (the  stabilization  strategy)  and  the  restriction  procedure  are  observed  to 
be  important  to  the  multigrid  convergence  rate.  It  appears  that  the  use  of  second- 
order  upwinding  on  all  grid  levels  and  the  restriction  procedure  3,  summing  the 
residuals  but  not  restricting  the  solutions,  provides  a  very  effective  approach  for  both 
the  symmetric  backward-facing  step  flow  and  the  lid-driven  cavity  flow.  Smoothing 
rates  per  V(3,2)  cycle  of  0.6  can  be  maintained  until  the  residual  is  driven  down  to 
the  level  of  the  roundoff  error.  The  convergence  rate  with  cell-face  averaging  for  the 
restriction  of  solutions  and  residuals  was  considerably  slower.  Similar  results  were 
obtained  for  the  cavitv  flow. 
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In  terms  of  the  coarse-grid  discretization  strategy,  it  appears  that  the  popular 
defect-correction  approach  may  not  be  as  robust  as  the  second-order  upwinding 
strategy,  at  least  for  entering-type  flow  problems.  In  these  types  of  flows,  i.e.  prob- 
lems with  inflow  and  outflow,  the  proper  formulation  of  the  numerical  method  (the 
pressure-correction  smoother)  is  critical  for  obtaining  good  convergence  rates.  Global 
mass  conservation  must  be  explicitly  enforced  during  the  course  of  iterations.  Global 
mass  conservation  ensures  that  the  system  of  pressure-correction  equations  has  a 
solution,  which  is  identified  as  an  important  prerequisite  for  obtaining  reasonable 
convergence  rates  in  open  boundary  problems.  The  well-posed  numerical  problem 
does  not  distinguish  between  inflow  and  outflow  at  the  open  boundary — if  the  nu- 
merical treatement  of  the  open  boundary  condition  is  reasonable  and  can  induce 
convergence,  the  finite-volume  staggered-grid  pressure-correction  method  can  obtain 
the  correct  numerical  solution  even  if  inflow  occurs  at  a  nominally  outflow  boundary. 

In  conclusion,  the  results  of  this  research  indicate  that  pressure-based  multigrid 
methods  are  computationally  and  numerically  scalable  algorithms  on  SIMD  com- 
puters. Taking  proper  account  of  the  many  implementational  considerations,  high 
parallel  efficiencies  can  be  achieved  and  maintained  as  the  number  of  processors  and 
the  problem  size  increases.  Likewise,  the  convergence  rate  dependence  on  problem 
size  should  be  greatly  decreased  by  the  multigrid  technique.  Thus  the  present  ap- 
proach is  viable  for  massively-parallel  numerical  simulations  of  the  incompressible 
Navier-Stokes  equations,  and  should  be  developed  further  on  SIMD  computers.  The 
target  machine  should  be  have  fast  nearest-neighbor  and  front-end-to-processor  com- 
munication compared  to  the  speed  of  computation,  so  that  reasonably  high  parallel 
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efficiencies  can  be  obtained  at  small  problem  sizes.  The  knowledge  and  implementa- 
tions gained  in  this  research  are  immediately  useful  for  exploiting  the  current  com- 
putational capabilities  of  the  CM-5  and  MP-2  SIMD  computers,  and  are  practical 
contributions  which  will  facilitate  future  research  in  parallel  CFD. 
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Level  4  (fine  grid) 


Level  3 


Level  2 


Level  1  (coarse  grid) 

"  1-FMG" procedure  for  V(3,2)  cycle  ©  =  3  smoothing  iterations 


Figure  5.1.  Schematic  of  an  FMG  V(3,2)  multigrid  cycle. 
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Re  =  5000  Lid-Driven  Cavity  Flow 
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Figure  5.2.  Streamfunction,  vorticity,  and  pressure  contours  for  Re  =  5000  lid-driven 
cavity  flow,  using  the  2nd-order  upwind  convection  scheme.  The  streamfunction 
contours  are  evenly  spaced  within  the  recirculation  bubbles  and  in  the  interior  of  the 
flows,  but  this  spacing  is  not  the  same.  The  actual  velocities  within  the  recirculation 
regions  are  relatively  weak  compared  to  the  core  flows. 
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Re  =  300  Symmetric  Backward-Facing  Step  Flow 
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Figure  5.3.  Streamfunction,  vorticity,  pressure,  and  velocity  component  contours  for 
Re  =  300  symmetric  backward-facing  step  flow,  using  the  2nd-order  upwind  convec- 
tion scheme.  The  streamfunction  contours  are  evenly  spaced  within  the  recirculation 
bubbles  and  in  the  interior  of  the  flows,  but  this  spacing  is  not  the  same.  The  actual 
velocities  within  the  recirculation  regions  are  relatively  weak  compared  to  the  core 
flows. 
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Initial  MG-convergence  path 
for  T.E.  criterion  w/denom.  =  1 
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Figure  5.4.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  5000  lid-driven  cavity  flow,  using  the  defect-correction  stabilization 
strategy.  The  truncation  error  criterion,  with  denominator  1,  is  used  to  determine 
the  coarse-grid  tolerances.  The  abscissas  plot  work  units  (proportional  to  a  serial 
computer's  cpu  time),  and  CM-5  busy  time. 
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Initial  MG-convergence  path 
for  T.E.  criterion  w/denom.  =  5 
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Figure  5.5.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  5000  lid-driven  cavity  flow,  using  the  defect-correction  stabilization 
strategy.  The  truncation  error  criterion,  with  denominator  5,  is  used  to  determine 
the  coarse-grid  tolerances.  The  abscissas  plot  work  units  (proportional  to  a  serial 
computer's  cpu  time),  and  CM-5  busy  time. 
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Initial  MG-convergence  path 
for  constant  -3.0  tolerances 
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Figure  5.6.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  5000  lid-driven  cavity  flow,  using  the  defect-correction  stabilization 
strategy.  The  coarse-grid  convergence  criterion  is  ||r''||  <  —3.0  on  every  level.  The 
abscissas  plot  work  units  (proportional  to  a  serial  computer's  cpu  time)  and  CM-5 
busy  time. 
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Initial  MG-convergence  path 
for  graded  tolerances 
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Figure  5.7.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  5000  Hd-driven  cavity  flow,  using  the  defect-correction  stabilization 
strategy.  The  coarse-grid  convergence  criteria  are  graded.  For  levels  2 — 6,  \\r  ||  < 
—0.7, —1.3, —1.9,— 2.5, —3.1.  The  abscissas  plot  work  units  (proportional  to  a  serial 
computer's  cpu  time)  and  CM-5  busy  time. 
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Initial  MG-convergence  path 
for  T.E.  criterion  w/denom.  =  1 
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Figure  5.8.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  300  symmetric  backward-facing  step  flow,  with  the  defect-correction 
stabilization  strategy.  The  truncation  error  criterion,  with  denominator  1,  is  applied 
to  abbreviate  coarse-grid  multigrid  cycling. 
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Initial  MG-convergence  path 
for  T.E.  criterion  w/denonn.  =  1 
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Figure  5.9.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  300  symmetric  backward-facing  step  flow,  with  second-order  upwind- 
ing  on  all  levels.  The  truncation  error  criterion,  with  denominator  1,  is  applied  to 
abbreviate  coarse-grid  multigrid  cycling. 
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Initial  MG-convergence  path 
forT.E.  criterion  w/denom.  =  5 


-1 


CO  _2 

■g 
'co 

2-3 
I 

CO 

O 
>  -5 


-6 


1 

. 

\level  3 

■ 

VJevel  4 

■ 

^ 

level  5 

\^ 

\^ 

^- 

10 


20  30 

Work  Units 


40 


50 


-1 


CO  . 

-? 

3 

■D 

CO 

(U 

1 

-3 

3 

CD 

-4 

CO 

i- 

(D 

> 

-5 

< 

-6 


\ 

. 

\level  3 

\ 

\ 

^vel4 

- 

- 

\ 

^^,^  level  5 

.               1 1 

2  4  6  8  10 

CM-5  Busy  Time  (seconds) 


12 


Figure  5.10.  The  convergence  path  of  the  ti-residual  norm  during  the  FMG  procedure 
for  the  Re  =  300  symmetric  backward-facing  step  flow,  using  second-order  upwind- 
ing  on  all  levels.  The  truncation  error  criterion,  with  denominator  5,  is  applied  to 
abbreviate  coarse-grid  multigrid  cycling. 
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Initial  MG-convergence  path 
for  constant  -3.0  tolerances 


u 

O 

1 1" 

- 

5-2 

. 

3 

level  3 

£-3 

1 

3 

w 

\      I  level  4 

■ 

CD_4 
(0 

^"■^ 

level  5 

1— 

_i 1 

10 


20  30 

Work  Units 


40 


50 


5  10 

CM-5  Busy  Time  (seconds) 

Figure  5.11.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  300  symmetric  backward-facing  step  flow,  using  second-order  upwinding 
on  all  levels.  The  coarse-grid  convergence  criterion  is  ||r''l|  <  -3.0  on  every  level. 
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Initial  MG-convergence  path 
for  graded  tolerances 
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Figure  5.12.  The  convergence  path  of  the  u-residual  norm  during  the  FMG  procedure 
for  the  Re  =  300  symmetric  backward-facing  step  flow,  using  second-order  upwinding 
on  all  levels.  The  coarse-grid  convergence  criteria  are  graded.  For  levels  2 — 4,  ||r''||  < 
-2.5,-3.1.-3.7. 
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Re  =  5000  MG-Convergence  Paths 
for  different  FMG  procedures 
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Figure  5.13.  The  convergence  path  of  the  average  u-residual  norm  on  the  finest  grid 
level  in  the  7-level  Re  =  5000  lid-driven  cavity  flow.  The  relaxation  factors  used  were 


UJuv  =  ^c  =  0-5. 
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Re  =  300  MG-Convergence  Paths 
for  different  FMG  procedures 
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Figure  5.14.  The  convergence  path  of  the  average  u-residual  norm  on  the  finest  grid 
level  in  the  5-level  Re  =  300  symmetric  backward-facing  step  flow.  The  relaxation 
factors  used  were  uJuv  =  0.6,  and  lOc  =  0.4. 
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Relative  Cost  of  Multigrid  Components 
on  32  and  512  node  CM-5s 
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Figure  5.15.  The  relative  cost  of  smoothing  and  prolongation  per  V-cycle,  as  a 
function  of  the  problem  size,  for  32  and  512-node  CM-5  computers  (128  and  2048 
processors,  respectively).  The  run  times  are  obtained  from  V(3,2)  cycles,  which  have 
5  smoothing  iterations,  1  restriction,  and  1  prolongation  at  each  grid  level.  Elapsed 
time  (includes  front-end-to-processor  communication)  is  plotted.  The  restriction  cost 
is  slightly  less  than  the  prolongation  cost  when  only  residuals  are  restricted,  slightly 
more  when  solutions  are  restricted  too,  but  the  trend  is  the  same  as  for  prolongation 
and  is  therefore  not  shown  for  clarity. 
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Figure  5.16.  Smoothing  cost,  in  terms  of  elapsed  and  busy  time  on  a  32-node  CM-5, 
as  a  function  of  the  multigrid  level  for  a  case  with  a  1024  x  1024  fine  grid.  The 
elapsed  time  is  the  one  on  the  right  (always  greater  than  the  busy  time).  The  times 
correspond  to  one  SIMPLE  iteration. 
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3  Level  V  and  W-Cycle  Times 
on  a  32-node  CM-5 
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Figure  5.17.  Parallel  run  time,  per  cycle,  on  a  32-node  CM-5,  as  a  function  of  the 
problem  size.  V(3,2)  cycle  cost  is  compared  with  W(3,2)  cycle  cost  in  terms  of  total 
elapsed  time  (dashed  lines),  and  busy  time  (solid  lines).  As  the  problem  size  increases 
the  number  of  multigrid  levels  remains  fixed  at  three. 
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5  Level  V  and  W-Cycle  Times 
on  a  32-node  CM-5 
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Figure  5.18.  Parallel  run  time,  per  cycle,  on  a  32-node  CM-5,  as  a  function  of  the 
problem  size.  V(3,2)  cycle  cost  is  compared  with  W(3,2)  cycle  cost  in  terms  of  total 
elapsed  time  (dashed  lines),  and  busy  time  (solid  lines).  As  the  problem  size  increases 
the  number  of  multigrid  levels  remains  fixed  at  five. 
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7  Level  V  and  W-Cycle  Times 
on  a  32-node  CM-5 
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Figure  5.19.  Parallel  run  time,  per  cycle,  on  a  32-node  CM-5,  as  a  function  of  the 
problem  size.  V(3,2)  cycle  cost  is  compared  with  W(3,2)  cycle  cost  in  terms  of  total 
elapsed  time  (dashed  lines),  and  busy  time  (solid  lines).  As  the  problem  size  increases 
the  number  of  multigrid  levels  remains  fixed  at  seven. 
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Efficiency  vs.  Problem  Size  for  7-Level  Multigrid 
using  V(3,2)  Cycles 
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Figure  5.20.  Parallel  efficiency  of  the  7-level  multigrid  algorithm  on  the  CM-5,  as  a 
function  of  the  problem  size.  Efficiency  is  determined  from  Eq.  3.3,  where  Tp  is  the 
elapsed  time  for  a  fixed  number  of  V(3,2)  cycles  and  Ti  is  the  parallel  computation 
time  {Tnode-cpu)  multiplied  by  the  number  of  processors.  The  trend  is  the  same  as 
for  the  single-grid  algorithm,  indicating  the  dominant  contribution  of  the  smoother 
to  the  overall  multigrid  cost. 
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Efficiency  vs.  Number  of  CM-5  Nodes  for  7-Level  Multigrid 

using  V(3,2)  Cycles 
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Figure  5.21.  Parallel  efficiency  of  the  7-level  multigrid  algorithm  on  the  CM-5,  as 
a  function  of  the  number  of  processors,  for  several  problem  sizes.  Efficiency  is  de- 
termined from  Eq.  3.3,  where  Tp  is  the  elapsed  time  for  a  fixed  number  of  V(3,2) 
cycles,  and  Ti  is  the  parallel  computation  time  (Tnode-cpu)  multiplied  by  the  number 
of  processors.  There  is  only  a  small  fall-off  in  the  efficiency  as  rip  increases. 
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Isoefficiency  Curves  For  7-Level  Multigrid 
using  V(3,2)  Cycles 
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Figure  5.22.  Isoefficiency  curves  for  the  7-level  pressure-correction  multgrid  method, 
based  on  timings  of  a  fixed  number  of  V(3,2)  cycles,  using  point-Jacobi  inner  it- 
erations. The  plot  is  constructed  based  on  linear  least-squares  curve  fits  of  the 
data  in  Figures  5.21  and  5.20.  The  isoefficiency  curves  have  the  general  form 
A'^  =  an^  +  constant,  where  /?  ~  1.1  for  the  efficiencies  shown. 
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